Computer Vision and Pattern Recognition 139
☆ GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving
We propose GoalFlow, an end-to-end autonomous driving method for generating
high-quality multimodal trajectories. In autonomous driving scenarios, there is
rarely a single suitable trajectory. Recent methods have increasingly focused
on modeling multimodal trajectory distributions. However, they suffer from
trajectory selection complexity and reduced trajectory quality due to high
trajectory divergence and inconsistencies between guidance and scene
information. To address these issues, we introduce GoalFlow, a novel method
that effectively constrains the generative process to produce high-quality,
multimodal trajectories. To resolve the trajectory divergence problem inherent
in diffusion-based methods, GoalFlow constrains the generated trajectories by
introducing a goal point. GoalFlow establishes a novel scoring mechanism that
selects the most appropriate goal point from the candidate points based on
scene information. Furthermore, GoalFlow employs an efficient generative
method, Flow Matching, to generate multimodal trajectories, and incorporates a
refined scoring mechanism to select the optimal trajectory from the candidates.
Our experimental results, validated on the Navsim\cite{Dauner2024_navsim},
demonstrate that GoalFlow achieves state-of-the-art performance, delivering
robust multimodal trajectories for autonomous driving. GoalFlow achieved PDMS
of 90.3, significantly surpassing other methods. Compared with other
diffusion-policy-based methods, our approach requires only a single denoising
step to obtain excellent performance. The code is available at
https://github.com/YvanYin/GoalFlow.
☆ Fairness-Aware Low-Rank Adaptation Under Demographic Privacy Constraints
Pre-trained foundation models can be adapted for specific tasks using
Low-Rank Adaptation (LoRA). However, the fairness properties of these adapted
classifiers remain underexplored. Existing fairness-aware fine-tuning methods
rely on direct access to sensitive attributes or their predictors, but in
practice, these sensitive attributes are often held under strict consumer
privacy controls, and neither the attributes nor their predictors are available
to model developers, hampering the development of fair models. To address this
issue, we introduce a set of LoRA-based fine-tuning methods that can be trained
in a distributed fashion, where model developers and fairness auditors
collaborate without sharing sensitive attributes or predictors. In this paper,
we evaluate three such methods - sensitive unlearning, adversarial training,
and orthogonality loss - against a fairness-unaware baseline, using experiments
on the CelebA and UTK-Face datasets with an ImageNet pre-trained ViT-Base
model. We find that orthogonality loss consistently reduces bias while
maintaining or improving utility, whereas adversarial training improves False
Positive Rate Parity and Demographic Parity in some cases, and sensitive
unlearning provides no clear benefit. In tasks where significant biases are
present, distributed fairness-aware fine-tuning methods can effectively
eliminate bias without compromising consumer privacy and, in most cases,
improve model utility.
☆ Task-oriented Uncertainty Collaborative Learning for Label-Efficient Brain Tumor Segmentation
Zhenxuan Zhang, Hongjie Wu, Jiahao Huang, Baihong Xie, Zhifan Gao, Junxian Du, Pete Lally, Guang Yang
Multi-contrast magnetic resonance imaging (MRI) plays a vital role in brain
tumor segmentation and diagnosis by leveraging complementary information from
different contrasts. Each contrast highlights specific tumor characteristics,
enabling a comprehensive understanding of tumor morphology, edema, and
pathological heterogeneity. However, existing methods still face the challenges
of multi-level specificity perception across different contrasts, especially
with limited annotations. These challenges include data heterogeneity,
granularity differences, and interference from redundant information. To
address these limitations, we propose a Task-oriented Uncertainty Collaborative
Learning (TUCL) framework for multi-contrast MRI segmentation. TUCL introduces
a task-oriented prompt attention (TPA) module with intra-prompt and
cross-prompt attention mechanisms to dynamically model feature interactions
across contrasts and tasks. Additionally, a cyclic process is designed to map
the predictions back to the prompt to ensure that the prompts are effectively
utilized. In the decoding stage, the TUCL framework proposes a dual-path
uncertainty refinement (DUR) strategy which ensures robust segmentation by
refining predictions iteratively. Extensive experimental results on limited
labeled data demonstrate that TUCL significantly improves segmentation accuracy
(88.2\% in Dice and 10.853 mm in HD95). It shows that TUCL has the potential to
extract multi-contrast information and reduce the reliance on extensive
annotations. The code is available at:
https://github.com/Zhenxuan-Zhang/TUCL_BrainSeg.
☆ AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data CVPR 2025
Recent advances in generative models have sparked research on improving model
fairness with AI-generated data. However, existing methods often face
limitations in the diversity and quality of synthetic data, leading to
compromised fairness and overall model accuracy. Moreover, many approaches rely
on the availability of demographic group labels, which are often costly to
annotate. This paper proposes AIM-Fair, aiming to overcome these limitations
and harness the potential of cutting-edge generative models in promoting
algorithmic fairness. We investigate a fine-tuning paradigm starting from a
biased model initially trained on real-world data without demographic
annotations. This model is then fine-tuned using unbiased synthetic data
generated by a state-of-the-art diffusion model to improve its fairness. Two
key challenges are identified in this fine-tuning paradigm, 1) the low quality
of synthetic data, which can still happen even with advanced generative models,
and 2) the domain and bias gap between real and synthetic data. To address the
limitation of synthetic data quality, we propose Contextual Synthetic Data
Generation (CSDG) to generate data using a text-to-image diffusion model (T2I)
with prompts generated by a context-aware LLM, ensuring both data diversity and
control of bias in synthetic data. To resolve domain and bias shifts, we
introduce a novel selective fine-tuning scheme in which only model parameters
more sensitive to bias and less sensitive to domain shift are updated.
Experiments on CelebA and UTKFace datasets show that our AIM-Fair improves
model fairness while maintaining utility, outperforming both fully and
partially fine-tuned approaches to model fairness.
comment: Accepted at CVPR 2025. Github:
https://github.com/zengqunzhao/AIM-Fair. Project page:
https://zengqunzhao.github.io/AIMFair
☆ NoT: Federated Unlearning via Weight Negation
Federated unlearning (FU) aims to remove a participant's data contributions
from a trained federated learning (FL) model, ensuring privacy and regulatory
compliance. Traditional FU methods often depend on auxiliary storage on either
the client or server side or require direct access to the data targeted for
removal-a dependency that may not be feasible if the data is no longer
available. To overcome these limitations, we propose NoT, a novel and efficient
FU algorithm based on weight negation (multiplying by -1), which circumvents
the need for additional storage and access to the target data. We argue that
effective and efficient unlearning can be achieved by perturbing model
parameters away from the set of optimal parameters, yet being well-positioned
for quick re-optimization. This technique, though seemingly contradictory, is
theoretically grounded: we prove that the weight negation perturbation
effectively disrupts inter-layer co-adaptation, inducing unlearning while
preserving an approximate optimality property, thereby enabling rapid recovery.
Experimental results across three datasets and three model architectures
demonstrate that NoT significantly outperforms existing baselines in unlearning
efficacy as well as in communication and computational efficiency.
comment: The 42nd IEEE/CVF Conference on Computer Vision and Pattern
Recognition, Nashville TN, US. 2025
☆ BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities
Yunfan Jiang, Ruohan Zhang, Josiah Wong, Chen Wang, Yanjie Ze, Hang Yin, Cem Gokmen, Shuran Song, Jiajun Wu, Li Fei-Fei
Real-world household tasks present significant challenges for mobile
manipulation robots. An analysis of existing robotics benchmarks reveals that
successful task performance hinges on three key whole-body control
capabilities: bimanual coordination, stable and precise navigation, and
extensive end-effector reachability. Achieving these capabilities requires
careful hardware design, but the resulting system complexity further
complicates visuomotor policy learning. To address these challenges, we
introduce the BEHAVIOR Robot Suite (BRS), a comprehensive framework for
whole-body manipulation in diverse household tasks. Built on a bimanual,
wheeled robot with a 4-DoF torso, BRS integrates a cost-effective whole-body
teleoperation interface for data collection and a novel algorithm for learning
whole-body visuomotor policies. We evaluate BRS on five challenging household
tasks that not only emphasize the three core capabilities but also introduce
additional complexities, such as long-range navigation, interaction with
articulated and deformable objects, and manipulation in confined spaces. We
believe that BRS's integrated robotic embodiment, data collection interface,
and learning framework mark a significant step toward enabling real-world
whole-body manipulation for everyday household tasks. BRS is open-sourced at
https://behavior-robot-suite.github.io/
comment: Project website: https://behavior-robot-suite.github.io/
☆ VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control
Video inpainting, which aims to restore corrupted video content, has
experienced substantial progress. Despite these advances, existing methods,
whether propagating unmasked region pixels through optical flow and receptive
field priors, or extending image-inpainting models temporally, face challenges
in generating fully masked objects or balancing the competing objectives of
background context preservation and foreground generation in one model,
respectively. To address these limitations, we propose a novel dual-stream
paradigm VideoPainter that incorporates an efficient context encoder
(comprising only 6% of the backbone parameters) to process masked videos and
inject backbone-aware background contextual cues to any pre-trained video DiT,
producing semantically consistent content in a plug-and-play manner. This
architectural separation significantly reduces the model's learning complexity
while enabling nuanced integration of crucial background context. We also
introduce a novel target region ID resampling technique that enables any-length
video inpainting, greatly enhancing our practical applicability. Additionally,
we establish a scalable dataset pipeline leveraging current vision
understanding models, contributing VPData and VPBench to facilitate
segmentation-based inpainting training and assessment, the largest video
inpainting dataset and benchmark to date with over 390K diverse clips. Using
inpainting as a pipeline basis, we also explore downstream applications
including video editing and video editing pair data generation, demonstrating
competitive performance and significant practical potential. Extensive
experiments demonstrate VideoPainter's superior performance in both any-length
video inpainting and editing, across eight key metrics, including video
quality, mask region preservation, and textual coherence.
comment: Project page available at
https://yxbian23.github.io/project/video-painter
☆ TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
We present TrajectoryCrafter, a novel approach to redirect camera
trajectories for monocular videos. By disentangling deterministic view
transformations from stochastic content generation, our method achieves precise
control over user-specified camera trajectories. We propose a novel dual-stream
conditional video diffusion model that concurrently integrates point cloud
renders and source videos as conditions, ensuring accurate view transformations
and coherent 4D content generation. Instead of leveraging scarce multi-view
videos, we curate a hybrid training dataset combining web-scale monocular
videos with static multi-view datasets, by our innovative double-reprojection
strategy, significantly fostering robust generalization across diverse scenes.
Extensive evaluations on multi-view and large-scale monocular videos
demonstrate the superior performance of our method.
comment: Project webpage: https://trajectorycrafter.github.io/
☆ Joint 3D Point Cloud Segmentation using Real-Sim Loop: From Panels to Trees and Branches ICRA 2025
Modern orchards are planted in structured rows with distinct panel divisions
to improve management. Accurate and efficient joint segmentation of point cloud
from Panel to Tree and Branch (P2TB) is essential for robotic operations.
However, most current segmentation methods focus on single instance
segmentation and depend on a sequence of deep networks to perform joint tasks.
This strategy hinders the use of hierarchical information embedded in the data,
leading to both error accumulation and increased costs for annotation and
computation, which limits its scalability for real-world applications. In this
study, we proposed a novel approach that incorporated a Real2Sim L-TreeGen for
training data generation and a joint model (J-P2TB) designed for the P2TB task.
The J-P2TB model, trained on the generated simulation dataset, was used for
joint segmentation of real-world panel point clouds via zero-shot learning.
Compared to representative methods, our model outperformed them in most
segmentation metrics while using 40% fewer learnable parameters. This Sim2Real
result highlighted the efficacy of L-TreeGen in model training and the
performance of J-P2TB for joint segmentation, demonstrating its strong
accuracy, efficiency, and generalizability for real-world applications. These
improvements would not only greatly benefit the development of robots for
automated orchard operations but also advance digital twin technology.
comment: Accepted by ICRA 2025
☆ FMT:A Multimodal Pneumonia Detection Model Based on Stacking MOE Framework
Artificial intelligence has shown the potential to improve diagnostic
accuracy through medical image analysis for pneumonia diagnosis. However,
traditional multimodal approaches often fail to address real-world challenges
such as incomplete data and modality loss. In this study, a Flexible Multimodal
Transformer (FMT) was proposed, which uses ResNet-50 and BERT for joint
representation learning, followed by a dynamic masked attention strategy that
simulates clinical modality loss to improve robustness; finally, a sequential
mixture of experts (MOE) architecture was used to achieve multi-level decision
refinement. After evaluation on a small multimodal pneumonia dataset, FMT
achieved state-of-the-art performance with 94% accuracy, 95% recall, and 93% F1
score, outperforming single-modal baselines (ResNet: 89%; BERT: 79%) and the
medical benchmark CheXMed (90%), providing a scalable solution for multimodal
diagnosis of pneumonia in resource-constrained medical settings.
☆ Conformal Prediction for Image Segmentation Using Morphological Prediction Sets
Image segmentation is a challenging task influenced by multiple sources of
uncertainty, such as the data labeling process or the sampling of training
data. In this paper we focus on binary segmentation and address these
challenges using conformal prediction, a family of model- and data-agnostic
methods for uncertainty quantification that provide finite-sample theoretical
guarantees and applicable to any pretrained predictor. Our approach involves
computing nonconformity scores, a type of prediction residual, on held-out
calibration data not used during training. We use dilation, one of the
fundamental operations in mathematical morphology, to construct a margin added
to the borders of predicted segmentation masks. At inference, the predicted set
formed by the mask and its margin contains the ground-truth mask with high
probability, at a confidence level specified by the user. The size of the
margin serves as an indicator of predictive uncertainty for a given model and
dataset. We work in a regime of minimal information as we do not require any
feedback from the predictor: only the predicted masks are needed for computing
the prediction sets. Hence, our method is applicable to any segmentation model,
including those based on deep learning; we evaluate our approach on several
medical imaging applications.
☆ CACTUS: An Open Dataset and Framework for Automated Cardiac Assessment and Classification of Ultrasound Images Using Deep Transfer Learning
Hanae Elmekki, Ahmed Alagha, Hani Sami, Amanda Spilkin, Antonela Mariel Zanuttini, Ehsan Zakeri, Jamal Bentahar, Lyes Kadem, Wen-Fang Xie, Philippe Pibarot, Rabeb Mizouni, Hadi Otrok, Shakti Singh, Azzam Mourad
Cardiac ultrasound (US) scanning is a commonly used techniques in cardiology
to diagnose the health of the heart and its proper functioning. Therefore, it
is necessary to consider ways to automate these tasks and assist medical
professionals in classifying and assessing cardiac US images. Machine learning
(ML) techniques are regarded as a prominent solution due to their success in
numerous applications aimed at enhancing the medical field, including
addressing the shortage of echography technicians. However, the limited
availability of medical data presents a significant barrier to applying ML in
cardiology, particularly regarding US images of the heart. This paper addresses
this challenge by introducing the first open graded dataset for Cardiac
Assessment and ClassificaTion of UltraSound (CACTUS), which is available
online. This dataset contains images obtained from scanning a CAE Blue Phantom
and representing various heart views and different quality levels, exceeding
the conventional cardiac views typically found in the literature. Additionally,
the paper introduces a Deep Learning (DL) framework consisting of two main
components. The first component classifies cardiac US images based on the heart
view using a Convolutional Neural Network (CNN). The second component uses
Transfer Learning (TL) to fine-tune the knowledge from the first component and
create a model for grading and assessing cardiac images. The framework
demonstrates high performance in both classification and grading, achieving up
to 99.43% accuracy and as low as 0.3067 error, respectively. To showcase its
robustness, the framework is further fine-tuned using new images representing
additional cardiac views and compared to several other state-of-the-art
architectures. The framework's outcomes and performance in handling real-time
scans were also assessed using a questionnaire answered by cardiac experts.
☆ D2GV: Deformable 2D Gaussian Splatting for Video Representation in 400FPS
Implicit Neural Representations (INRs) have emerged as a powerful approach
for video representation, offering versatility across tasks such as compression
and inpainting. However, their implicit formulation limits both
interpretability and efficacy, undermining their practicality as a
comprehensive solution. We propose a novel video representation based on
deformable 2D Gaussian splatting, dubbed D2GV, which aims to achieve three key
objectives: 1) improved efficiency while delivering superior quality; 2)
enhanced scalability and interpretability; and 3) increased friendliness for
downstream tasks. Specifically, we initially divide the video sequence into
fixed-length Groups of Pictures (GoP) to allow parallel training and linear
scalability with video length. For each GoP, D2GV represents video frames by
applying differentiable rasterization to 2D Gaussians, which are deformed from
a canonical space into their corresponding timestamps. Notably, leveraging
efficient CUDA-based rasterization, D2GV converges fast and decodes at speeds
exceeding 400 FPS, while delivering quality that matches or surpasses
state-of-the-art INRs. Moreover, we incorporate a learnable pruning and
quantization strategy to streamline D2GV into a more compact representation. We
demonstrate D2GV's versatility in tasks including video interpolation,
inpainting and denoising, underscoring its potential as a promising solution
for video representation. Code is available at:
\href{https://github.com/Evan-sudo/D2GV}{https://github.com/Evan-sudo/D2GV}.
☆ Anti-Diffusion: Preventing Abuse of Modifications of Diffusion-Based Models
Although diffusion-based techniques have shown remarkable success in image
generation and editing tasks, their abuse can lead to severe negative social
impacts. Recently, some works have been proposed to provide defense against the
abuse of diffusion-based methods. However, their protection may be limited in
specific scenarios by manually defined prompts or the stable diffusion (SD)
version. Furthermore, these methods solely focus on tuning methods, overlooking
editing methods that could also pose a significant threat. In this work, we
propose Anti-Diffusion, a privacy protection system designed for general
diffusion-based methods, applicable to both tuning and editing techniques. To
mitigate the limitations of manually defined prompts on defense performance, we
introduce the prompt tuning (PT) strategy that enables precise expression of
original images. To provide defense against both tuning and editing methods, we
propose the semantic disturbance loss (SDL) to disrupt the semantic information
of protected images. Given the limited research on the defense against editing
methods, we develop a dataset named Defense-Edit to assess the defense
performance of various methods. Experiments demonstrate that our Anti-Diffusion
achieves superior defense performance across a wide range of diffusion-based
techniques in different scenarios.
☆ QArtSR: Quantization via Reverse-Module and Timestep-Retraining in One-Step Diffusion based Image Super-Resolution
Libo Zhu, Haotong Qin, Kaicheng Yang, Wenbo Li, Yong Guo, Yulun Zhang, Susanto Rahardja, Xiaokang Yang
One-step diffusion-based image super-resolution (OSDSR) models are showing
increasingly superior performance nowadays. However, although their denoising
steps are reduced to one and they can be quantized to 8-bit to reduce the costs
further, there is still significant potential for OSDSR to quantize to lower
bits. To explore more possibilities of quantized OSDSR, we propose an efficient
method, Quantization via reverse-module and timestep-retraining for OSDSR,
named QArtSR. Firstly, we investigate the influence of timestep value on the
performance of quantized models. Then, we propose Timestep Retraining
Quantization (TRQ) and Reversed Per-module Quantization (RPQ) strategies to
calibrate the quantized model. Meanwhile, we adopt the module and image losses
to update all quantized modules. We only update the parameters in quantization
finetuning components, excluding the original weights. To ensure that all
modules are fully finetuned, we add extended end-to-end training after
per-module stage. Our 4-bit and 2-bit quantization experimental results
indicate that QArtSR obtains superior effects against the recent leading
comparison methods. The performance of 4-bit QArtSR is close to the
full-precision one. Our code will be released at
https://github.com/libozhu03/QArtSR.
☆ Novel Object 6D Pose Estimation with a Single Reference View
Existing novel object 6D pose estimation methods typically rely on CAD models
or dense reference views, which are both difficult to acquire. Using only a
single reference view is more scalable, but challenging due to large pose
discrepancies and limited geometric and spatial information. To address these
issues, we propose a Single-Reference-based novel object 6D (SinRef-6D) pose
estimation method. Our key idea is to iteratively establish point-wise
alignment in the camera coordinate system based on state space models (SSMs).
Specifically, iterative camera-space point-wise alignment can effectively
handle large pose discrepancies, while our proposed RGB and Points SSMs can
capture long-range dependencies and spatial information from a single view,
offering linear complexity and superior spatial modeling capability. Once
pre-trained on synthetic data, SinRef-6D can estimate the 6D pose of a novel
object using only a single reference view, without requiring retraining or a
CAD model. Extensive experiments on six popular datasets and real-world robotic
scenes demonstrate that we achieve on-par performance with CAD-based and dense
reference view-based methods, despite operating in the more challenging single
reference setting. Code will be released at
https://github.com/CNJianLiu/SinRef-6D.
comment: 17 pages, 12 figures (including supplementary material)
☆ TomatoScanner: phenotyping tomato fruit based on only RGB image
In tomato greenhouse, phenotypic measurement is meaningful for researchers
and farmers to monitor crop growth, thereby precisely control environmental
conditions in time, leading to better quality and higher yield. Traditional
phenotyping mainly relies on manual measurement, which is accurate but
inefficient, more importantly, endangering the health and safety of people.
Several studies have explored computer vision-based methods to replace manual
phenotyping. However, the 2D-based need extra calibration, or cause destruction
to fruit, or can only measure limited and meaningless traits. The 3D-based need
extra depth camera, which is expensive and unacceptable for most farmers. In
this paper, we propose a non-contact tomato fruit phenotyping method, titled
TomatoScanner, where RGB image is all you need for input. First, pixel feature
is extracted by instance segmentation of our proposed EdgeYOLO with
preprocessing of individual separation and pose correction. Second, depth
feature is extracted by depth estimation of Depth Pro. Third, pixel and depth
feature are fused to output phenotype results in reality. We establish
self-built Tomato Phenotype Dataset to test TomatoScanner, which achieves
excellent phenotyping on width, height, vertical area and volume, with median
relative error of 5.63%, 7.03%, -0.64% and 37.06%, respectively. We propose and
add three innovative modules - EdgeAttention, EdgeLoss and EdgeBoost - into
EdgeYOLO, to enhance the segmentation accuracy on edge portion. Precision and
mean Edge Error greatly improve from 0.943 and 5.641% to 0.986 and 2.963%,
respectively. Meanwhile, EdgeYOLO keeps lightweight and efficient, with 48.7 M
weights size and 76.34 FPS. Codes and datasets:
https://github.com/AlexTraveling/TomatoScanner.
comment: 12 pages, 37 figures. Codes and datasets are open-sourced in
https://github.com/AlexTraveling/TomatoScanner
☆ Stereo Any Video: Temporally Consistent Stereo Matching
This paper introduces Stereo Any Video, a powerful framework for video stereo
matching. It can estimate spatially accurate and temporally consistent
disparities without relying on auxiliary information such as camera poses or
optical flow. The strong capability is driven by rich priors from monocular
video depth models, which are integrated with convolutional features to produce
stable representations. To further enhance performance, key architectural
innovations are introduced: all-to-all-pairs correlation, which constructs
smooth and robust matching cost volumes, and temporal convex upsampling, which
improves temporal coherence. These components collectively ensure robustness,
accuracy, and temporal consistency, setting a new standard in video stereo
matching. Extensive experiments demonstrate that our method achieves
state-of-the-art performance across multiple datasets both qualitatively and
quantitatively in zero-shot settings, as well as strong generalization to
real-world indoor and outdoor scenarios.
☆ Pi-GPS: Enhancing Geometry Problem Solving by Unleashing the Power of Diagrammatic Information
Geometry problem solving has garnered increasing attention due to its
potential applications in intelligent education field. Inspired by the
observation that text often introduces ambiguities that diagrams can clarify,
this paper presents Pi-GPS, a novel framework that unleashes the power of
diagrammatic information to resolve textual ambiguities, an aspect largely
overlooked in prior research. Specifically, we design a micro module comprising
a rectifier and verifier: the rectifier employs MLLMs to disambiguate text
based on the diagrammatic context, while the verifier ensures the rectified
output adherence to geometric rules, mitigating model hallucinations.
Additionally, we explore the impact of LLMs in theorem predictor based on the
disambiguated formal language. Empirical results demonstrate that Pi-GPS
surpasses state-of-the-art models, achieving a nearly 10\% improvement on
Geometry3K over prior neural-symbolic approaches. We hope this work highlights
the significance of resolving textual ambiguity in multimodal mathematical
reasoning, a crucial factor limiting performance.
☆ Disconnect to Connect: A Data Augmentation Method for Improving Topology Accuracy in Image Segmentation
Juan Miguel Valverde, Maja Østergaard, Adrian Rodriguez-Palomo, Peter Alling Strange Vibe, Nina Kølln Wittig, Henrik Birkedal, Anders Bjorholm Dahl
Accurate segmentation of thin, tubular structures (e.g., blood vessels) is
challenging for deep neural networks. These networks classify individual
pixels, and even minor misclassifications can break the thin connections within
these structures. Existing methods for improving topology accuracy, such as
topology loss functions, rely on very precise, topologically-accurate training
labels, which are difficult to obtain. This is because annotating images,
especially 3D images, is extremely laborious and time-consuming. Low image
resolution and contrast further complicates the annotation by causing tubular
structures to appear disconnected. We present CoLeTra, a data augmentation
strategy that integrates to the models the prior knowledge that structures that
appear broken are actually connected. This is achieved by creating images with
the appearance of disconnected structures while maintaining the original
labels. Our extensive experiments, involving different architectures, loss
functions, and datasets, demonstrate that CoLeTra leads to segmentations
topologically more accurate while often improving the Dice coefficient and
Hausdorff distance. CoLeTra's hyper-parameters are intuitive to tune, and our
sensitivity analysis shows that CoLeTra is robust to changes in these
hyper-parameters. We also release a dataset specifically suited for image
segmentation methods with a focus on topology accuracy. CoLetra's code can be
found at https://github.com/jmlipman/CoLeTra.
☆ S4M: Segment Anything with 4 Extreme Points
Adrien Meyer, Lorenzo Arboit, Giuseppe Massimiani, Francesco Brucchi, Luca Emanuele Amodio, Didier Mutter, Nicolas Padoy
The Segment Anything Model (SAM) has revolutionized open-set interactive
image segmentation, inspiring numerous adapters for the medical domain.
However, SAM primarily relies on sparse prompts such as point or bounding box,
which may be suboptimal for fine-grained instance segmentation, particularly in
endoscopic imagery, where precise localization is critical and existing prompts
struggle to capture object boundaries effectively. To address this, we
introduce S4M (Segment Anything with 4 Extreme Points), which augments SAM by
leveraging extreme points -- the top-, bottom-, left-, and right-most points of
an instance -- prompts. These points are intuitive to identify and provide a
faster, structured alternative to box prompts. However, a na\"ive use of
extreme points degrades performance, due to SAM's inability to interpret their
semantic roles. To resolve this, we introduce dedicated learnable embeddings,
enabling the model to distinguish extreme points from generic free-form points
and better reason about their spatial relationships. We further propose an
auxiliary training task through the Canvas module, which operates solely on
prompts -- without vision input -- to predict a coarse instance mask. This
encourages the model to internalize the relationship between extreme points and
mask distributions, leading to more robust segmentation. S4M outperforms other
SAM-based approaches on three endoscopic surgical datasets, demonstrating its
effectiveness in complex scenarios. Finally, we validate our approach through a
human annotation study on surgical endoscopic videos, confirming that extreme
points are faster to acquire than bounding boxes.
☆ State-of-the-Art Stroke Lesion Segmentation at 1/1000th of Parameters
Efficient and accurate whole-brain lesion segmentation remains a challenge in
medical image analysis. In this work, we revisit MeshNet, a parameter-efficient
segmentation model, and introduce a novel multi-scale dilation pattern with an
encoder-decoder structure. This innovation enables capturing broad contextual
information and fine-grained details without traditional downsampling,
upsampling, or skip-connections. Unlike previous approaches processing
subvolumes or slices, we operate directly on whole-brain $256^3$ MRI volumes.
Evaluations on the Aphasia Recovery Cohort (ARC) dataset demonstrate that
MeshNet achieves superior or comparable DICE scores to state-of-the-art
architectures such as MedNeXt and U-MAMBA at 1/1000th of parameters. Our
results validate MeshNet's strong balance of efficiency and performance, making
it particularly suitable for resource-limited environments such as web-based
applications and opening new possibilities for the widespread deployment of
advanced medical image analysis tools.
comment: International Symposium on Biomedical Imaging, April 14-17, 2025
☆ Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations
Concept Activation Vectors (CAVs) are widely used to model
human-understandable concepts as directions within the latent space of neural
networks. They are trained by identifying directions from the activations of
concept samples to those of non-concept samples. However, this method often
produces similar, non-orthogonal directions for correlated concepts, such as
"beard" and "necktie" within the CelebA dataset, which frequently co-occur in
images of men. This entanglement complicates the interpretation of concepts in
isolation and can lead to undesired effects in CAV applications, such as
activation steering. To address this issue, we introduce a post-hoc concept
disentanglement method that employs a non-orthogonality loss, facilitating the
identification of orthogonal concept directions while preserving directional
correctness. We evaluate our approach with real-world and controlled correlated
concepts in CelebA and a synthetic FunnyBirds dataset with VGG16 and ResNet18
architectures. We further demonstrate the superiority of orthogonalized concept
representations in activation steering tasks, allowing (1) the insertion of
isolated concepts into input images through generative models and (2) the
removal of concepts for effective shortcut suppression with reduced impact on
correlated concepts in comparison to baseline CAVs.
☆ Removing Geometric Bias in One-Class Anomaly Detection with Adaptive Feature Perturbation WACV 2025
One-class anomaly detection aims to detect objects that do not belong to a
predefined normal class. In practice training data lack those anomalous
samples; hence state-of-the-art methods are trained to discriminate between
normal and synthetically-generated pseudo-anomalous data. Most methods use data
augmentation techniques on normal images to simulate anomalies. However the
best-performing ones implicitly leverage a geometric bias present in the
benchmarking datasets. This limits their usability in more general conditions.
Others are relying on basic noising schemes that may be suboptimal in capturing
the underlying structure of normal data. In addition most still favour the
image domain to generate pseudo-anomalies training models end-to-end from only
the normal class and overlooking richer representations of the information. To
overcome these limitations we consider frozen yet rich feature spaces given by
pretrained models and create pseudo-anomalous features with a novel adaptive
linear feature perturbation technique. It adapts the noise distribution to each
sample applies decaying linear perturbations to feature vectors and further
guides the classification process using a contrastive learning objective.
Experimental evaluation conducted on both standard and geometric bias-free
datasets demonstrates the superiority of our approach with respect to
comparable baselines. The codebase is accessible via our public repository.
comment: Published in WACV 2025
☆ FastMap: Fast Queries Initialization Based Vectorized HD Map Reconstruction Framework
Reconstruction of high-definition maps is a crucial task in perceiving the
autonomous driving environment, as its accuracy directly impacts the
reliability of prediction and planning capabilities in downstream modules.
Current vectorized map reconstruction methods based on the DETR framework
encounter limitations due to the redundancy in the decoder structure,
necessitating the stacking of six decoder layers to maintain performance, which
significantly hampers computational efficiency. To tackle this issue, we
introduce FastMap, an innovative framework designed to reduce decoder
redundancy in existing approaches. FastMap optimizes the decoder architecture
by employing a single-layer, two-stage transformer that achieves multilevel
representation capabilities. Our framework eliminates the conventional practice
of randomly initializing queries and instead incorporates a heatmap-guided
query generation module during the decoding phase, which effectively maps image
features into structured query vectors using learnable positional encoding.
Additionally, we propose a geometry-constrained point-to-line loss mechanism
for FastMap, which adeptly addresses the challenge of distinguishing highly
homogeneous features that often arise in traditional point-to-point loss
computations. Extensive experiments demonstrate that FastMap achieves
state-of-the-art performance in both nuScenes and Argoverse2 datasets, with its
decoder operating 3.2 faster than the baseline. Code and more demos are
available at https://github.com/hht1996ok/FastMap.
☆ DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction CVPR2025
We present DecoupledGaussian, a novel system that decouples static objects
from their contacted surfaces captured in-the-wild videos, a key prerequisite
for realistic Newtonian-based physical simulations. Unlike prior methods
focused on synthetic data or elastic jittering along the contact surface, which
prevent objects from fully detaching or moving independently, DecoupledGaussian
allows for significant positional changes without being constrained by the
initial contacted surface. Recognizing the limitations of current 2D inpainting
tools for restoring 3D locations, our approach proposes joint Poisson fields to
repair and expand the Gaussians of both objects and contacted scenes after
separation. This is complemented by a multi-carve strategy to refine the
object's geometry. Our system enables realistic simulations of decoupling
motions, collisions, and fractures driven by user-specified impulses,
supporting complex interactions within and across multiple scenes. We validate
DecoupledGaussian through a comprehensive user study and quantitative
benchmarks. This system enhances digital interaction with objects and scenes in
real-world environments, benefiting industries such as VR, robotics, and
autonomous driving. Our project page is at:
https://wangmiaowei.github.io/DecoupledGaussian.github.io/.
comment: CVPR2025 Accepted
☆ Automatic Teaching Platform on Vision Language Retrieval Augmented Generation
Automating teaching presents unique challenges, as replicating human
interaction and adaptability is complex. Automated systems cannot often provide
nuanced, real-time feedback that aligns with students' individual learning
paces or comprehension levels, which can hinder effective support for diverse
needs. This is especially challenging in fields where abstract concepts require
adaptive explanations. In this paper, we propose a vision language retrieval
augmented generation (named VL-RAG) system that has the potential to bridge
this gap by delivering contextually relevant, visually enriched responses that
can enhance comprehension. By leveraging a database of tailored answers and
images, the VL-RAG system can dynamically retrieve information aligned with
specific questions, creating a more interactive and engaging experience that
fosters deeper understanding and active student participation. It allows
students to explore concepts visually and verbally, promoting deeper
understanding and reducing the need for constant human oversight while
maintaining flexibility to expand across different subjects and course
material.
☆ Towards Locally Explaining Prediction Behavior via Gradual Interventions and Measuring Property Gradients
Deep learning models achieve high predictive performance but lack intrinsic
interpretability, hindering our understanding of the learned prediction
behavior. Existing local explainability methods focus on associations,
neglecting the causal drivers of model predictions. Other approaches adopt a
causal perspective but primarily provide more general global explanations.
However, for specific inputs, it's unclear whether globally identified factors
apply locally. To address this limitation, we introduce a novel framework for
local interventional explanations by leveraging recent advances in
image-to-image editing models. Our approach performs gradual interventions on
semantic properties to quantify the corresponding impact on a model's
predictions using a novel score, the expected property gradient magnitude. We
demonstrate the effectiveness of our approach through an extensive empirical
evaluation on a wide range of architectures and tasks. First, we validate it in
a synthetic scenario and demonstrate its ability to locally identify biases.
Afterward, we apply our approach to analyze network training dynamics,
investigate medical skin lesion classifiers, and study a pre-trained CLIP model
with real-life interventional data. Our results highlight the potential of
interventional explanations on the property level to reveal new insights into
the behavior of deep models.
comment: 44 pages, 39 figures, 14 tables
☆ Semantic Shift Estimation via Dual-Projection and Classifier Reconstruction for Exemplar-Free Class-Incremental Learning
Exemplar-Free Class-Incremental Learning (EFCIL) aims to sequentially learn
from distinct categories without retaining exemplars but easily suffers from
catastrophic forgetting of learned knowledge. While existing EFCIL methods
leverage knowledge distillation to alleviate forgetting, they still face two
critical challenges: semantic shift and decision bias. Specifically, the
embeddings of old tasks shift in the embedding space after learning new tasks,
and the classifier becomes biased towards new tasks due to training solely with
new data, thereby hindering the balance between old and new knowledge. To
address these issues, we propose the Dual-Projection Shift Estimation and
Classifier Reconstruction (DPCR) approach for EFCIL. DPCR effectively estimates
semantic shift through a dual-projection, which combines a learnable
transformation with a row-space projection to capture both task-wise and
category-wise shifts. Furthermore, to mitigate decision bias, DPCR employs
ridge regression to reformulate classifier training as a reconstruction
process. This reconstruction exploits previous information encoded in
covariance and prototype of each class after calibration with estimated shift,
thereby reducing decision bias. Extensive experiments demonstrate that, across
various datasets, DPCR effectively balances old and new tasks, outperforming
state-of-the-art EFCIL methods.
comment: 14 pages, 7 figures
☆ Self-Modeling Robots by Photographing
Self-modeling enables robots to build task-agnostic models of their
morphology and kinematics based on data that can be automatically collected,
with minimal human intervention and prior information, thereby enhancing
machine intelligence. Recent research has highlighted the potential of
data-driven technology in modeling the morphology and kinematics of robots.
However, existing self-modeling methods suffer from either low modeling quality
or excessive data acquisition costs. Beyond morphology and kinematics, texture
is also a crucial component of robots, which is challenging to model and
remains unexplored. In this work, a high-quality, texture-aware, and link-level
method is proposed for robot self-modeling. We utilize three-dimensional (3D)
Gaussians to represent the static morphology and texture of robots, and cluster
the 3D Gaussians to construct neural ellipsoid bones, whose deformations are
controlled by the transformation matrices generated by a kinematic neural
network. The 3D Gaussians and kinematic neural network are trained using data
pairs composed of joint angles, camera parameters and multi-view images without
depth information. By feeding the kinematic neural network with joint angles,
we can utilize the well-trained model to describe the corresponding morphology,
kinematics and texture of robots at the link level, and render robot images
from different perspectives with the aid of 3D Gaussian splatting. Furthermore,
we demonstrate that the established model can be exploited to perform
downstream tasks such as motion planning and inverse kinematics.
☆ R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning
In this work, we present the first application of Reinforcement Learning with
Verifiable Reward (RLVR) to an Omni-multimodal large language model in the
context of emotion recognition, a task where both visual and audio modalities
play crucial roles. We leverage RLVR to optimize the Omni model, significantly
enhancing its performance in three key aspects: reasoning capability, emotion
recognition accuracy, and generalization ability. The introduction of RLVR not
only improves the model's overall performance on in-distribution data but also
demonstrates superior robustness when evaluated on out-of-distribution
datasets. More importantly, the improved reasoning capability enables clear
analysis of the contributions of different modalities, particularly visual and
audio information, in the emotion recognition process. This provides valuable
insights into the optimization of multimodal large language models.
☆ Multi-Grained Feature Pruning for Video-Based Human Pose Estimation
Human pose estimation, with its broad applications in action recognition and
motion capture, has experienced significant advancements. However, current
Transformer-based methods for video pose estimation often face challenges in
managing redundant temporal information and achieving fine-grained perception
because they only focus on processing low-resolution features. To address these
challenges, we propose a novel multi-scale resolution framework that encodes
spatio-temporal representations at varying granularities and executes
fine-grained perception compensation. Furthermore, we employ a density peaks
clustering method to dynamically identify and prioritize tokens that offer
important semantic information. This strategy effectively prunes redundant
feature tokens, especially those arising from multi-frame features, thereby
optimizing computational efficiency without sacrificing semantic richness.
Empirically, it sets new benchmarks for both performance and efficiency on
three large-scale datasets. Our method achieves a 93.8% improvement in
inference speed compared to the baseline, while also enhancing pose estimation
accuracy, reaching 87.4 mAP on the PoseTrack2017 dataset.
☆ Pretext Task Adversarial Learning for Unpaired Low-field to Ultra High-field MRI Synthesis
Given the scarcity and cost of high-field MRI, the synthesis of high-field
MRI from low-field MRI holds significant potential when there is limited data
for training downstream tasks (e.g. segmentation). Low-field MRI often suffers
from a reduced signal-to-noise ratio (SNR) and spatial resolution compared to
high-field MRI. However, synthesizing high-field MRI data presents challenges.
These involve aligning image features across domains while preserving
anatomical accuracy and enhancing fine details. To address these challenges, we
propose a Pretext Task Adversarial (PTA) learning framework for high-field MRI
synthesis from low-field MRI data. The framework comprises three processes: (1)
The slice-wise gap perception (SGP) network aligns the slice inconsistencies of
low-field and high-field datasets based on contrastive learning. (2) The local
structure correction (LSC) network extracts local structures by restoring the
locally rotated and masked images. (3) The pretext task-guided adversarial
training process introduces additional supervision and incorporates a
discriminator to improve image realism. Extensive experiments on low-field to
ultra high-field task demonstrate the effectiveness of our method, achieving
state-of-the-art performance (16.892 in FID, 1.933 in IS, and 0.324 in
MS-SSIM). This enables the generation of high-quality high-field-like MRI data
from low-field MRI data to augment training datasets for downstream tasks. The
code is available at:
https://github.com/Zhenxuan-Zhang/PTA4Unpaired_HF_MRI_SYN.
☆ New multimodal similarity measure for image registration via modeling local functional dependence with linear combination of learned basis functions
The deformable registration of images of different modalities, essential in
many medical imaging applications, remains challenging. The main challenge is
developing a robust measure for image overlap despite the compared images
capturing different aspects of the underlying tissue. Here, we explore
similarity metrics based on functional dependence between intensity values of
registered images. Although functional dependence is too restrictive on the
global scale, earlier work has shown competitive performance in deformable
registration when such measures are applied over small enough contexts. We
confirm this finding and further develop the idea by modeling local functional
dependence via the linear basis function model with the basis functions learned
jointly with the deformation. The measure can be implemented via convolutions,
making it efficient to compute on GPUs. We release the method as an easy-to-use
tool and show good performance on three datasets compared to well-established
baseline and earlier functional dependence-based methods.
☆ PhysicsGen: Can Generative Models Learn from Images to Predict Complex Physical Relations?
The image-to-image translation abilities of generative learning models have
recently made significant progress in the estimation of complex (steered)
mappings between image distributions. While appearance based tasks like image
in-painting or style transfer have been studied at length, we propose to
investigate the potential of generative models in the context of physical
simulations. Providing a dataset of 300k image-pairs and baseline evaluations
for three different physical simulation tasks, we propose a benchmark to
investigate the following research questions: i) are generative models able to
learn complex physical relations from input-output image pairs? ii) what
speedups can be achieved by replacing differential equation based simulations?
While baseline evaluations of different current models show the potential for
high speedups (ii), these results also show strong limitations toward the
physical correctness (i). This underlines the need for new methods to enforce
physical correctness. Data, baseline models and evaluation code
http://www.physics-gen.org.
☆ CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images
Jungho Lee, Donghyeong Kim, Dogyoon Lee, Suhwan Cho, Minhyeok Lee, Wonjoon Lee, Taeoh Kim, Dongyoon Wee, Sangyoun Lee
3D Gaussian Splatting (3DGS) has gained significant attention for their
high-quality novel view rendering, motivating research to address real-world
challenges. A critical issue is the camera motion blur caused by movement
during exposure, which hinders accurate 3D scene reconstruction. In this study,
we propose CoMoGaussian, a Continuous Motion-Aware Gaussian Splatting that
reconstructs precise 3D scenes from motion-blurred images while maintaining
real-time rendering speed. Considering the complex motion patterns inherent in
real-world camera movements, we predict continuous camera trajectories using
neural ordinary differential equations (ODEs). To ensure accurate modeling, we
employ rigid body transformations, preserving the shape and size of the object
but rely on the discrete integration of sampled frames. To better approximate
the continuous nature of motion blur, we introduce a continuous motion
refinement (CMR) transformation that refines rigid transformations by
incorporating additional learnable parameters. By revisiting fundamental camera
theory and leveraging advanced neural ODE techniques, we achieve precise
modeling of continuous camera trajectories, leading to improved reconstruction
accuracy. Extensive experiments demonstrate state-of-the-art performance both
quantitatively and qualitatively on benchmark datasets, which include a wide
range of motion blur scenarios, from moderate to extreme blur.
comment: Revised Version of CRiM-GS, Github:
https://github.com/Jho-Yonsei/CoMoGaussian
☆ Attenuation artifact detection and severity classification in intracoronary OCT using mixed image representations
Pierandrea Cancian, Simone Saitta, Xiaojin Gu, Rudolf L. M. van Herten, Thijs J. Luttikholt, Jos Thannhauser, Rick H. J. A. Volleberg, Ruben G. A. van der Waerden, Joske L. van der Zande, Clarisa I. Sánchez, Bram van Ginneken, Niels van Royen, Ivana Išgum
In intracoronary optical coherence tomography (OCT), blood residues and gas
bubbles cause attenuation artifacts that can obscure critical vessel
structures. The presence and severity of these artifacts may warrant
re-acquisition, prolonging procedure time and increasing use of contrast agent.
Accurate detection of these artifacts can guide targeted re-acquisition,
reducing the amount of repeated scans needed to achieve diagnostically viable
images. However, the highly heterogeneous appearance of these artifacts poses a
challenge for the automated detection of the affected image regions. To enable
automatic detection of the attenuation artifacts caused by blood residues and
gas bubbles based on their severity, we propose a convolutional neural network
that performs classification of the attenuation lines (A-lines) into three
classes: no artifact, mild artifact and severe artifact. Our model extracts and
merges features from OCT images in both Cartesian and polar coordinates, where
each column of the image represents an A-line. Our method detects the presence
of attenuation artifacts in OCT frames reaching F-scores of 0.77 and 0.94 for
mild and severe artifacts, respectively. The inference time over a full OCT
scan is approximately 6 seconds. Our experiments show that analysis of images
represented in both Cartesian and polar coordinate systems outperforms the
analysis in polar coordinates only, suggesting that these representations
contain complementary features. This work lays the foundation for automated
artifact assessment and image acquisition guidance in intracoronary OCT
imaging.
☆ Robust Multimodal Learning for Ophthalmic Disease Grading via Disentangled Representation
Xinkun Wang, Yifang Wang, Senwei Liang, Feilong Tang, Chengzhi Liu, Ming Hu, Chao Hu, Junjun He, Zongyuan Ge, Imran Razzak
This paper discusses how ophthalmologists often rely on multimodal data to
improve diagnostic accuracy. However, complete multimodal data is rare in
real-world applications due to a lack of medical equipment and concerns about
data privacy. Traditional deep learning methods typically address these issues
by learning representations in latent space. However, the paper highlights two
key limitations of these approaches: (i) Task-irrelevant redundant information
(e.g., numerous slices) in complex modalities leads to significant redundancy
in latent space representations. (ii) Overlapping multimodal representations
make it difficult to extract unique features for each modality. To overcome
these challenges, the authors propose the Essence-Point and Disentangle
Representation Learning (EDRL) strategy, which integrates a self-distillation
mechanism into an end-to-end framework to enhance feature selection and
disentanglement for more robust multimodal learning. Specifically, the
Essence-Point Representation Learning module selects discriminative features
that improve disease grading performance. The Disentangled Representation
Learning module separates multimodal data into modality-common and
modality-unique representations, reducing feature entanglement and enhancing
both robustness and interpretability in ophthalmic disease diagnosis.
Experiments on multimodal ophthalmology datasets show that the proposed EDRL
strategy significantly outperforms current state-of-the-art methods.
comment: 10pages
☆ Frequency Autoregressive Image Generation with Continuous Tokens
Autoregressive (AR) models for image generation typically adopt a two-stage
paradigm of vector quantization and raster-scan ``next-token prediction",
inspired by its great success in language modeling. However, due to the huge
modality gap, image autoregressive models may require a systematic reevaluation
from two perspectives: tokenizer format and regression direction. In this
paper, we introduce the frequency progressive autoregressive (\textbf{FAR})
paradigm and instantiate FAR with the continuous tokenizer. Specifically, we
identify spectral dependency as the desirable regression direction for FAR,
wherein higher-frequency components build upon the lower one to progressively
construct a complete image. This design seamlessly fits the causality
requirement for autoregressive models and preserves the unique spatial locality
of image data. Besides, we delve into the integration of FAR and the continuous
tokenizer, introducing a series of techniques to address optimization
challenges and improve the efficiency of training and inference processes. We
demonstrate the efficacy of FAR through comprehensive experiments on the
ImageNet dataset and verify its potential on text-to-image generation.
☆ Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces CVPR 2025
Souhail Hadgi, Luca Moschella, Andrea Santilli, Diego Gomez, Qixing Huang, Emanuele Rodolà, Simone Melzi, Maks Ovsjanikov
Recent works have shown that, when trained at scale, uni-modal 2D vision and
text encoders converge to learned features that share remarkable structural
properties, despite arising from different representations. However, the role
of 3D encoders with respect to other modalities remains unexplored.
Furthermore, existing 3D foundation models that leverage large datasets are
typically trained with explicit alignment objectives with respect to frozen
encoders from other representations. In this work, we investigate the
possibility of a posteriori alignment of representations obtained from
uni-modal 3D encoders compared to text-based feature spaces. We show that naive
post-training feature alignment of uni-modal text and 3D encoders results in
limited performance. We then focus on extracting subspaces of the corresponding
feature spaces and discover that by projecting learned representations onto
well-chosen lower-dimensional subspaces the quality of alignment becomes
significantly higher, leading to improved accuracy on matching and retrieval
tasks. Our analysis further sheds light on the nature of these shared
subspaces, which roughly separate between semantic and geometric data
representations. Overall, ours is the first work that helps to establish a
baseline for post-training alignment of 3D uni-modal and text feature spaces,
and helps to highlight both the shared and unique properties of 3D data
compared to other representations.
comment: Accepted at CVPR 2025
☆ CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation
Guanghao Zhang, Tao Zhong, Yan Xia, Zhelun Yu, Haoyuan Li, Wanggui He, Fangxun Shu, Mushui Liu, Dong She, Yi Wang, Hao Jiang
While previous multimodal slow-thinking methods have demonstrated remarkable
success in single-image understanding scenarios, their effectiveness becomes
fundamentally constrained when extended to more complex multi-image
comprehension tasks. This limitation stems from their predominant reliance on
text-based intermediate reasoning processes. While for human, when engaging in
sophisticated multi-image analysis, they typically perform two complementary
cognitive operations: (1) continuous cross-image visual comparison through
region-of-interest matching, and (2) dynamic memorization of critical visual
concepts throughout the reasoning chain. Motivated by these observations, we
propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a
multi-step reasoning framework that mimics human-like "slow thinking" for
multi-image understanding. Our approach incorporates two key innovations: 1.
The construction of interleaved multimodal multi-step reasoning chains, which
utilize critical visual region tokens, extracted from intermediate reasoning
steps, as supervisory signals. This mechanism not only facilitates
comprehensive cross-modal understanding but also enhances model
interpretability. 2. The introduction of a test-time memory augmentation module
that expands the model reasoning capacity during inference while preserving
parameter efficiency. Furthermore, to facilitate research in this direction, we
have curated a novel multi-image slow-thinking dataset. Extensive experiments
demonstrate the effectiveness of our model.
☆ ColFigPhotoAttnNet: Reliable Finger Photo Presentation Attack Detection Leveraging Window-Attention on Color Spaces WACV
Finger photo Presentation Attack Detection (PAD) can significantly strengthen
smartphone device security. However, these algorithms are trained to detect
certain types of attacks. Furthermore, they are designed to operate on images
acquired by specific capture devices, leading to poor generalization and a lack
of robustness in handling the evolving nature of mobile hardware. The proposed
investigation is the first to systematically analyze the performance
degradation of existing deep learning PAD systems, convolutional and
transformers, in cross-capture device settings. In this paper, we introduce the
ColFigPhotoAttnNet architecture designed based on window attention on color
channels, followed by the nested residual network as the predictor to achieve a
reliable PAD. Extensive experiments using various capture devices, including
iPhone13 Pro, GooglePixel 3, Nokia C5, and OnePlusOne, were carried out to
evaluate the performance of proposed and existing methods on three publicly
available databases. The findings underscore the effectiveness of our approach.
comment: Accepted in Winter Conference on Applications of Computer Vision
(WACV) 2025
☆ L-FUSION: Laplacian Fetal Ultrasound Segmentation & Uncertainty Estimation
Johanna P. Müller, Robert Wright, Thomas G. Day, Lorenzo Venturini, Samuel F. Budd, Hadrien Reynaud, Joseph V. Hajnal, Reza Razavi, Bernhard Kainz
Accurate analysis of prenatal ultrasound (US) is essential for early
detection of developmental anomalies. However, operator dependency and
technical limitations (e.g. intrinsic artefacts and effects, setting errors)
can complicate image interpretation and the assessment of diagnostic
uncertainty. We present L-FUSION (Laplacian Fetal US Segmentation with
Integrated FoundatiON models), a framework that integrates uncertainty
quantification through unsupervised, normative learning and large-scale
foundation models for robust segmentation of fetal structures in normal and
pathological scans. We propose to utilise the aleatoric logit distributions of
Stochastic Segmentation Networks and Laplace approximations with fast Hessian
estimations to estimate epistemic uncertainty only from the segmentation head.
This enables us to achieve reliable abnormality quantification for instant
diagnostic feedback. Combined with an integrated Dropout component, L-FUSION
enables reliable differentiation of lesions from normal fetal anatomy with
enhanced uncertainty maps and segmentation counterfactuals in US imaging. It
improves epistemic and aleatoric uncertainty interpretation and removes the
need for manual disease-labelling. Evaluations across multiple datasets show
that L-FUSION achieves superior segmentation accuracy and consistent
uncertainty quantification, supporting on-site decision-making and offering a
scalable solution for advancing fetal ultrasound analysis in clinical settings.
comment: Under Review
☆ Unified Reward Model for Multimodal Understanding and Generation
Recent advances in human preference alignment have significantly enhanced
multimodal generation and understanding. A key approach is training reward
models to guide preference optimization. However, existing models are often
task-specific, limiting their adaptability across diverse visual applications.
We also argue that jointly learning to assess multiple tasks may foster a
synergistic effect, where improved image understanding enhances image
generation assessment, and refined image evaluation benefits video assessment
through better frame analysis. To this end, this paper proposes UnifiedReward,
the first unified reward model for multimodal understanding and generation
assessment, enabling both pairwise ranking and pointwise scoring, which can be
employed for vision model preference alignment. Specifically, (1) we first
develop UnifiedReward on our constructed large-scale human preference dataset,
including both image and video generation/understanding tasks. (2) Then, it is
utilized to automatically construct high-quality preference pair data based on
the vision models, fine-gradually filtering their outputs through pair ranking
and point sifting. (3) Finally, these data are used for their preference
alignment through Direct Preference Optimization (DPO). Experimental results
demonstrate that joint learning to assess diverse visual tasks can lead to
substantial mutual benefits and we apply our pipeline to both image and video
understanding/generation tasks, significantly improving the performance in each
domain.
comment: project page: https://codegoat24.github.io/UnifiedReward/
☆ RecipeGen: A Benchmark for Real-World Recipe Image Generation
Ruoxuan Zhang, Hongxia Xie, Yi Yao, Jian-Yu Jiang-Lin, Bin Wen, Ling Lo, Hong-Han Shuai, Yung-Hui Li, Wen-Huang Cheng
Recipe image generation is an important challenge in food computing, with
applications from culinary education to interactive recipe platforms. However,
there is currently no real-world dataset that comprehensively connects recipe
goals, sequential steps, and corresponding images. To address this, we
introduce RecipeGen, the first real-world goal-step-image benchmark for recipe
generation, featuring diverse ingredients, varied recipe steps, multiple
cooking styles, and a broad collection of food categories. Data is in
https://github.com/zhangdaxia22/RecipeGen.
☆ DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility NAACL 25
Video-to-speech (V2S) synthesis, the task of generating speech directly from
silent video input, is inherently more challenging than other speech synthesis
tasks due to the need to accurately reconstruct both speech content and speaker
characteristics from visual cues alone. Recently, audio-visual pre-training has
eliminated the need for additional acoustic hints in V2S, which previous
methods often relied on to ensure training convergence. However, even with
pre-training, existing methods continue to face challenges in achieving a
balance between acoustic intelligibility and the preservation of
speaker-specific characteristics. We analyzed this limitation and were
motivated to introduce DiVISe (Direct Visual-Input Speech Synthesis), an
end-to-end V2S model that predicts Mel-spectrograms directly from video frames
alone. Despite not taking any acoustic hints, DiVISe effectively preserves
speaker characteristics in the generated audio, and achieves superior
performance on both objective and subjective metrics across the LRS2 and LRS3
datasets. Our results demonstrate that DiVISe not only outperforms existing V2S
models in acoustic intelligibility but also scales more effectively with
increased data and model parameters. Code and weights can be found at
https://github.com/PussyCat0700/DiVISe.
comment: to be published in NAACL 25
☆ Separability Membrane: 3D Active Contour for Point Cloud Surface Reconstruction
This paper proposes Separability Membrane, a robust 3D active contour for
extracting a surface from 3D point cloud object. Our approach defines the
surface of a 3D object as the boundary that maximizes the separability of point
features, such as intensity, color, or local density, between its inner and
outer regions based on Fisher's ratio. Separability Membrane identifies the
exact surface of a 3D object by maximizing class separability while controlling
the rigidity of the 3D surface model with an adaptive B-spline surface that
adjusts its properties based on the local and global separability. A key
advantage of our method is its ability to accurately reconstruct surface
boundaries even when they are ambiguous due to noise or outliers, without
requiring any training data or conversion to volumetric representation.
Evaluations on a synthetic 3D point cloud dataset and the 3DNet dataset
demonstrate the membrane's effectiveness and robustness under diverse
conditions.
☆ Gaussian Random Fields as an Abstract Representation of Patient Metadata for Multimodal Medical Image Segmentation
Bill Cassidy, Christian McBride, Connah Kendrick, Neil D. Reeves, Joseph M. Pappachan, Shaghayegh Raad, Moi Hoon Yap
The growing rate of chronic wound occurrence, especially in patients with
diabetes, has become a concerning trend in recent years. Chronic wounds are
difficult and costly to treat, and have become a serious burden on health care
systems worldwide. Chronic wounds can have devastating consequences for the
patient, with infection often leading to reduced quality of life and increased
mortality risk. Innovative deep learning methods for the detection and
monitoring of such wounds have the potential to reduce the impact to both
patient and clinician. We present a novel multimodal segmentation method which
allows for the introduction of patient metadata into the training workflow
whereby the patient data are expressed as Gaussian random fields. Our results
indicate that the proposed method improved performance when utilising multiple
models, each trained on different metadata categories. Using the Diabetic Foot
Ulcer Challenge 2022 test set, when compared to the baseline results
(intersection over union = 0.4670, Dice similarity coefficient = 0.5908) we
demonstrate improvements of +0.0220 and +0.0229 for intersection over union and
Dice similarity coefficient respectively. This paper presents the first study
to focus on integrating patient data into a chronic wound segmentation
workflow. Our results show significant performance gains when training
individual models using specific metadata categories, followed by average
merging of prediction masks using distance transforms. All source code for this
study is available at:
https://github.com/mmu-dermatology-research/multimodal-grf
☆ Data-Efficient Generalization for Zero-shot Composed Image Retrieval
Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image
based on a reference image and a text description without requiring
in-distribution triplets for training. One prevalent approach follows the
vision-language pretraining paradigm that employs a mapping network to transfer
the image embedding to a pseudo-word token in the text embedding space.
However, this approach tends to impede network generalization due to modality
discrepancy and distribution shift between training and inference. To this end,
we propose a Data-efficient Generalization (DeG) framework, including two novel
designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set). The
TS module exploits compositional textual semantics during training, enhancing
the pseudo-word token with more linguistic semantics and thus mitigating the
modality discrepancy effectively. The S-Set exploits the zero-shot capability
of pretrained Vision-Language Models (VLMs), alleviating the distribution shift
and mitigating the overfitting issue from the redundancy of the large-scale
image-text data. Extensive experiments over four ZS-CIR benchmarks show that
DeG outperforms the state-of-the-art (SOTA) methods with much less training
data, and saves substantial training and inference time for practical usage.
☆ STGA: Selective-Training Gaussian Head Avatars
We propose selective-training Gaussian head avatars (STGA) to enhance the
details of dynamic head Gaussian. The dynamic head Gaussian model is trained
based on the FLAME parameterized model. Each Gaussian splat is embedded within
the FLAME mesh to achieve mesh-based animation of the Gaussian model. Before
training, our selection strategy calculates the 3D Gaussian splat to be
optimized in each frame. The parameters of these 3D Gaussian splats are
optimized in the training of each frame, while those of the other splats are
frozen. This means that the splats participating in the optimization process
differ in each frame, to improve the realism of fine details. Compared with
network-based methods, our method achieves better results with shorter training
time. Compared with mesh-based methods, our method produces more realistic
details within the same training time. Additionally, the ablation experiment
confirms that our method effectively enhances the quality of details.
☆ Partially Supervised Unpaired Multi-Modal Learning for Label-Efficient Medical Image Segmentation
Unpaired Multi-Modal Learning (UMML) which leverages unpaired multi-modal
data to boost model performance on each individual modality has attracted a lot
of research interests in medical image analysis. However, existing UMML methods
require multi-modal datasets to be fully labeled, which incurs tremendous
annotation cost. In this paper, we investigate the use of partially labeled
data for label-efficient unpaired multi-modal learning, which can reduce the
annotation cost by up to one half. We term the new learning paradigm as
Partially Supervised Unpaired Multi-Modal Learning (PSUMML) and propose a novel
Decomposed partial class adaptation with snapshot Ensembled Self-Training
(DEST) framework for it. Specifically, our framework consists of a compact
segmentation network with modality specific normalization layers for learning
with partially labeled unpaired multi-modal data. The key challenge in PSUMML
lies in the complex partial class distribution discrepancy due to partial class
annotation, which hinders effective knowledge transfer across modalities. We
theoretically analyze this phenomenon with a decomposition theorem and propose
a decomposed partial class adaptation technique to precisely align the
partially labeled classes across modalities to reduce the distribution
discrepancy. We further propose a snapshot ensembled self-training technique to
leverage the valuable snapshot models during training to assign pseudo-labels
to partially labeled pixels for self-training to boost model performance. We
perform extensive experiments under different scenarios of PSUMML for two
medical image segmentation tasks, namely cardiac substructure segmentation and
abdominal multi-organ segmentation. Our framework outperforms existing methods
significantly.
comment: Accepted to MLMI 2024
☆ Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions CVPR 2025
In recent text-video retrieval, the use of additional captions from
vision-language models has shown promising effects on the performance. However,
existing models using additional captions often have struggled to capture the
rich semantics, including temporal changes, inherent in the video. In addition,
incorrect information caused by generative models can lead to inaccurate
retrieval. To address these issues, we propose a new framework, Narrating the
Video (NarVid), which strategically leverages the comprehensive information
available from frame-level captions, the narration. The proposed NarVid
exploits narration in multiple ways: 1) feature enhancement through cross-modal
interactions between narration and video, 2) query-aware adaptive filtering to
suppress irrelevant or incorrect information, 3) dual-modal matching score by
adding query-video similarity and query-narration similarity, and 4)
hard-negative loss to learn discriminative features from multiple perspectives
using the two similarities from different views. Experimental results
demonstrate that NarVid achieves state-of-the-art performance on various
benchmark datasets.
comment: Accepted at CVPR 2025
☆ Spectral-Spatial Extraction through Layered Tensor Decomposition for Hyperspectral Anomaly Detection
Low rank tensor representation (LRTR) methods are very useful for
hyperspectral anomaly detection (HAD). To overcome the limitations that they
often overlook spectral anomaly and rely on large-scale matrix singular value
decomposition, we first apply non-negative matrix factorization (NMF) to
alleviate spectral dimensionality redundancy and extract spectral anomaly and
then employ LRTR to extract spatial anomaly while mitigating spatial
redundancy, yielding a highly efffcient layered tensor decomposition (LTD)
framework for HAD. An iterative algorithm based on proximal alternating
minimization is developed to solve the proposed LTD model, with convergence
guarantees provided. Moreover, we introduce a rank reduction strategy with
validation mechanism that adaptively reduces data size while preventing
excessive reduction. Theoretically, we rigorously establish the equivalence
between the tensor tubal rank and tensor group sparsity regularization (TGSR)
and, under mild conditions, demonstrate that the relaxed formulation of TGSR
shares the same global minimizers and optimal values as its original
counterpart. Experimental results on the Airport-Beach-Urban and MVTec datasets
demonstrate that our approach outperforms state-of-the-art methods in the HAD
task.
☆ MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions
Novel view synthesis (NVS) and surface reconstruction (SR) are essential
tasks in 3D Gaussian Splatting (3D-GS). Despite recent progress, these tasks
are often addressed independently, with GS-based rendering methods struggling
under diverse light conditions and failing to produce accurate surfaces, while
GS-based reconstruction methods frequently compromise rendering quality. This
raises a central question: must rendering and reconstruction always involve a
trade-off? To address this, we propose MGSR, a 2D/3D Mutual-boosted Gaussian
splatting for Surface Reconstruction that enhances both rendering quality and
3D reconstruction accuracy. MGSR introduces two branches--one based on 2D-GS
and the other on 3D-GS. The 2D-GS branch excels in surface reconstruction,
providing precise geometry information to the 3D-GS branch. Leveraging this
geometry, the 3D-GS branch employs a geometry-guided illumination decomposition
module that captures reflected and transmitted components, enabling realistic
rendering under varied light conditions. Using the transmitted component as
supervision, the 2D-GS branch also achieves high-fidelity surface
reconstruction. Throughout the optimization process, the 2D-GS and 3D-GS
branches undergo alternating optimization, providing mutual supervision. Prior
to this, each branch completes an independent warm-up phase, with an early
stopping strategy implemented to reduce computational costs. We evaluate MGSR
on a diverse set of synthetic and real-world datasets, at both object and scene
levels, demonstrating strong performance in rendering and surface
reconstruction.
comment: 11 pages, 7 figures
☆ SplatPose: Geometry-Aware 6-DoF Pose Estimation from Single RGB Image via 3D Gaussian Splatting IROS 2025
6-DoF pose estimation is a fundamental task in computer vision with
wide-ranging applications in augmented reality and robotics. Existing single
RGB-based methods often compromise accuracy due to their reliance on initial
pose estimates and susceptibility to rotational ambiguity, while approaches
requiring depth sensors or multi-view setups incur significant deployment
costs. To address these limitations, we introduce SplatPose, a novel framework
that synergizes 3D Gaussian Splatting (3DGS) with a dual-branch neural
architecture to achieve high-precision pose estimation using only a single RGB
image. Central to our approach is the Dual-Attention Ray Scoring Network
(DARS-Net), which innovatively decouples positional and angular alignment
through geometry-domain attention mechanisms, explicitly modeling directional
dependencies to mitigate rotational ambiguity. Additionally, a coarse-to-fine
optimization pipeline progressively refines pose estimates by aligning dense 2D
features between query images and 3DGS-synthesized views, effectively
correcting feature misalignment and depth errors from sparse ray sampling.
Experiments on three benchmark datasets demonstrate that SplatPose achieves
state-of-the-art 6-DoF pose estimation accuracy in single RGB settings,
rivaling approaches that depend on depth or multi-view images.
comment: Submitted to IROS 2025
☆ Spatial Context-Driven Positive Pair Sampling for Enhanced Histopathology Image Classification
Deep learning has demonstrated great promise in cancer classification from
whole-slide images (WSIs) but remains constrained by the need for extensive
annotations. Annotation-free methods, such as multiple instance learning (MIL)
and self-supervised learning (SSL), have emerged to address this challenge;
however, current SSL techniques often depend on synthetic augmentations or
temporal context, which may not adequately capture the intricate spatial
relationships inherent to histopathology. In this work, we introduce a novel
spatial context-driven positive pair sampling strategy for SSL that leverages
the natural coherence of adjacent patches in WSIs. By constructing biologically
relevant positive pairs from spatially proximate patches, our approach
harnesses inherent spatial coherence to enhance patch-level representations,
ultimately boosting slide-level classification performance. Experiments on
multiple datasets reveal that our strategy improves classification accuracy by
5\% to 10\% over the standard method, paving the way for more clinically
relevant AI models in cancer diagnosis. The code is available at
https://anonymous.4open.science/r/contextual-pairs-E72F/.
☆ EvolvingGS: High-Fidelity Streamable Volumetric Video via Evolving 3D Gaussian Representation
We have recently seen great progress in 3D scene reconstruction through
explicit point-based 3D Gaussian Splatting (3DGS), notable for its high quality
and fast rendering speed. However, reconstructing dynamic scenes such as
complex human performances with long durations remains challenging. Prior
efforts fall short of modeling a long-term sequence with drastic motions,
frequent topology changes or interactions with props, and resort to segmenting
the whole sequence into groups of frames that are processed independently,
which undermines temporal stability and thereby leads to an unpleasant viewing
experience and inefficient storage footprint. In view of this, we introduce
EvolvingGS, a two-stage strategy that first deforms the Gaussian model to
coarsely align with the target frame, and then refines it with minimal point
addition/subtraction, particularly in fast-changing areas. Owing to the
flexibility of the incrementally evolving representation, our method
outperforms existing approaches in terms of both per-frame and temporal quality
metrics while maintaining fast rendering through its purely explicit
representation. Moreover, by exploiting temporal coherence between successive
frames, we propose a simple yet effective compression algorithm that achieves
over 50x compression rate. Extensive experiments on both public benchmarks and
challenging custom datasets demonstrate that our method significantly advances
the state-of-the-art in dynamic scene reconstruction, particularly for extended
sequences with complex human performances.
☆ GaussianCAD: Robust Self-Supervised CAD Reconstruction from Three Orthographic Views Using 3D Gaussian Splatting
Zheng Zhou, Zhe Li, Bo Yu, Lina Hu, Liang Dong, Zijian Yang, Xiaoli Liu, Ning Xu, Ziwei Wang, Yonghao Dang, Jianqin Yin
The automatic reconstruction of 3D computer-aided design (CAD) models from
CAD sketches has recently gained significant attention in the computer vision
community. Most existing methods, however, rely on vector CAD sketches and 3D
ground truth for supervision, which are often difficult to be obtained in
industrial applications and are sensitive to noise inputs. We propose viewing
CAD reconstruction as a specific instance of sparse-view 3D reconstruction to
overcome these limitations. While this reformulation offers a promising
perspective, existing 3D reconstruction methods typically require natural
images and corresponding camera poses as inputs, which introduces two major
significant challenges: (1) modality discrepancy between CAD sketches and
natural images, and (2) difficulty of accurate camera pose estimation for CAD
sketches. To solve these issues, we first transform the CAD sketches into
representations resembling natural images and extract corresponding masks.
Next, we manually calculate the camera poses for the orthographic views to
ensure accurate alignment within the 3D coordinate system. Finally, we employ a
customized sparse-view 3D reconstruction method to achieve high-quality
reconstructions from aligned orthographic views. By leveraging raster CAD
sketches for self-supervision, our approach eliminates the reliance on vector
CAD sketches and 3D ground truth. Experiments on the Sub-Fusion360 dataset
demonstrate that our proposed method significantly outperforms previous
approaches in CAD reconstruction performance and exhibits strong robustness to
noisy inputs.
☆ Accelerating Diffusion Transformer via Gradient-Optimized Cache
Feature caching has emerged as an effective strategy to accelerate diffusion
transformer (DiT) sampling through temporal feature reuse. It is a challenging
problem since (1) Progressive error accumulation from cached blocks
significantly degrades generation quality, particularly when over 50\% of
blocks are cached; (2) Current error compensation approaches neglect dynamic
perturbation patterns during the caching process, leading to suboptimal error
correction. To solve these problems, we propose the Gradient-Optimized Cache
(GOC) with two key innovations: (1) Cached Gradient Propagation: A gradient
queue dynamically computes the gradient differences between cached and
recomputed features. These gradients are weighted and propagated to subsequent
steps, directly compensating for the approximation errors introduced by
caching. (2) Inflection-Aware Optimization: Through statistical analysis of
feature variation patterns, we identify critical inflection points where the
denoising trajectory changes direction. By aligning gradient updates with these
detected phases, we prevent conflicting gradient directions during error
correction. Extensive evaluations on ImageNet demonstrate GOC's superior
trade-off between efficiency and quality. With 50\% cached blocks, GOC achieves
IS 216.28 (26.3\% higher) and FID 3.907 (43\% lower) compared to baseline DiT,
while maintaining identical computational costs. These improvements persist
across various cache ratios, demonstrating robust adaptability to different
acceleration requirements.
☆ Development and Enhancement of Text-to-Image Diffusion Models
This research focuses on the development and enhancement of text-to-image
denoising diffusion models, addressing key challenges such as limited sample
diversity and training instability. By incorporating Classifier-Free Guidance
(CFG) and Exponential Moving Average (EMA) techniques, this study significantly
improves image quality, diversity, and stability. Utilizing Hugging Face's
state-of-the-art text-to-image generation model, the proposed enhancements
establish new benchmarks in generative AI. This work explores the underlying
principles of diffusion models, implements advanced strategies to overcome
existing limitations, and presents a comprehensive evaluation of the
improvements achieved. Results demonstrate substantial progress in generating
stable, diverse, and high-quality images from textual descriptions, advancing
the field of generative artificial intelligence and providing new foundations
for future applications.
Keywords: Text-to-image, Diffusion model, Classifier-free guidance,
Exponential moving average, Image generation.
★ R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
Recently DeepSeek R1 demonstrated how reinforcement learning with simple
rule-based incentives can enable autonomous development of complex reasoning in
large language models, characterized by the "aha moment", in which the model
manifest self-reflection and increased response length during training.
However, attempts to extend this success to multimodal reasoning often failed
to reproduce these key characteristics. In this report, we present the first
successful replication of these emergent characteristics for multimodal
reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying
reinforcement learning directly on the SAT dataset, our model achieves 59.47%
accuracy on CVBench, outperforming the base model by approximately ~30% and
exceeding both SFT setting by ~2%. In addition, we share our failed attempts
and insights in attempting to achieve R1-like reasoning using RL with instruct
models. aiming to shed light on the challenges involved. Our key observations
include: (1) applying RL on instruct model often results in trivial reasoning
trajectories, and (2) naive length reward are ineffective in eliciting
reasoning capabilities. The project code is available at
https://github.com/turningpoint-ai/VisualThinker-R1-Zero
comment: 10 pages, 6 figures
☆ HexPlane Representation for 3D Semantic Scene Understanding
In this paper, we introduce the HexPlane representation for 3D semantic scene
understanding. Specifically, we first design the View Projection Module (VPM)
to project the 3D point cloud into six planes to maximally retain the original
spatial information. Features of six planes are extracted by the 2D encoder and
sent to the HexPlane Association Module (HAM) to adaptively fuse the most
informative information for each point. The fused point features are further
fed to the task head to yield the ultimate predictions. Compared to the popular
point and voxel representation, the HexPlane representation is efficient and
can utilize highly optimized 2D operations to process sparse and unordered 3D
point clouds. It can also leverage off-the-shelf 2D models, network weights,
and training recipes to achieve accurate scene understanding in 3D space. On
ScanNet and SemanticKITTI benchmarks, our algorithm, dubbed HexNet3D, achieves
competitive performance with previous algorithms. In particular, on the ScanNet
3D segmentation task, our method obtains 77.0 mIoU on the validation set,
surpassing Point Transformer V2 by 1.6 mIoU. We also observe encouraging
results in indoor 3D detection tasks. Note that our method can be seamlessly
integrated into existing voxel-based, point-based, and range-based approaches
and brings considerable gains without bells and whistles. The codes will be
available upon publication.
comment: 7 pages, 2 figures
☆ EDM: Efficient Deep Feature Matching
Recent feature matching methods have achieved remarkable performance but lack
efficiency consideration. In this paper, we revisit the mainstream
detector-free matching pipeline and improve all its stages considering both
accuracy and efficiency. We propose an Efficient Deep feature Matching network,
EDM. We first adopt a deeper CNN with fewer dimensions to extract multi-level
features. Then we present a Correlation Injection Module that conducts feature
transformation on high-level deep features, and progressively injects feature
correlations from global to local for efficient multi-scale feature
aggregation, improving both speed and performance. In the refinement stage, a
novel lightweight bidirectional axis-based regression head is designed to
directly predict subpixel-level correspondences from latent features, avoiding
the significant computational cost of explicitly locating keypoints on
high-resolution local feature heatmaps. Moreover, effective selection
strategies are introduced to enhance matching accuracy. Extensive experiments
show that our EDM achieves competitive matching accuracy on various benchmarks
and exhibits excellent efficiency, offering valuable best practices for
real-world applications. The code is available at
https://github.com/chicleee/EDM.
☆ SMILENet: Unleashing Extra-Large Capacity Image Steganography via a Synergistic Mosaic InvertibLE Hiding Network
Jun-Jie Huang, Zihan Chen, Tianrui Liu, Wentao Zhao, Xin Deng, Xinwang Liu, Meng Wang, Pier Luigi Dragotti
Existing image steganography methods face fundamental limitations in hiding
capacity (typically $1\sim7$ images) due to severe information interference and
uncoordinated capacity-distortion trade-off. We propose SMILENet, a novel
synergistic framework that achieves 25 image hiding through three key
innovations: (i) A synergistic network architecture coordinates reversible and
non-reversible operations to efficiently exploit information redundancy in both
secret and cover images. The reversible Invertible Cover-Driven Mosaic (ICDM)
module and Invertible Mosaic Secret Embedding (IMSE) module establish
cover-guided mosaic transformations and representation embedding with
mathematically guaranteed invertibility for distortion-free embedding. The
non-reversible Secret Information Selection (SIS) module and Secret Detail
Enhancement (SDE) module implement learnable feature modulation for critical
information selection and enhancement. (ii) A unified training strategy that
coordinates complementary modules to achieve 3.0x higher capacity than existing
methods with superior visual quality. (iii) Last but not least, we introduce a
new metric to model Capacity-Distortion Trade-off for evaluating the image
steganography algorithms that jointly considers hiding capacity and distortion,
and provides a unified evaluation approach for accessing results with different
number of secret image. Extensive experiments on DIV2K, Paris StreetView and
ImageNet1K show that SMILENet outperforms state-of-the-art methods in terms of
hiding capacity, recovery quality as well as security against steganalysis
methods.
☆ We Care Each Pixel: Calibrating on Medical Segmentation Model
Medical image segmentation is fundamental for computer-aided diagnostics,
providing accurate delineation of anatomical structures and pathological
regions. While common metrics such as Accuracy, DSC, IoU, and HD primarily
quantify spatial agreement between predictions and ground-truth labels, they do
not assess the calibration quality of segmentation models, which is crucial for
clinical reliability. To address this limitation, we propose pixel-wise
Expected Calibration Error (pECE), a novel metric that explicitly measures
miscalibration at the pixel level, thereby ensuring both spatial precision and
confidence reliability. We further introduce a morphological adaptation
strategy that applies morphological operations to ground-truth masks before
computing calibration losses, particularly benefiting margin-based losses such
as Margin SVLS and NACL. Additionally, we present the Signed Distance
Calibration Loss (SDC), which aligns boundary geometry with calibration
objectives by penalizing discrepancies between predicted and ground-truth
signed distance functions (SDFs). Extensive experiments demonstrate that our
method not only enhances segmentation performance but also improves calibration
quality, yielding more trustworthy confidence estimates. Code is available at:
https://github.com/EagleAdelaide/SDC-Loss.
comment: Under Reviewing
☆ Visual Cues of Gender and Race are Associated with Stereotyping in Vision-Language Models
Current research on bias in Vision Language Models (VLMs) has important
limitations: it is focused exclusively on trait associations while ignoring
other forms of stereotyping, it examines specific contexts where biases are
expected to appear, and it conceptualizes social categories like race and
gender as binary, ignoring the multifaceted nature of these identities. Using
standardized facial images that vary in prototypicality, we test four VLMs for
both trait associations and homogeneity bias in open-ended contexts. We find
that VLMs consistently generate more uniform stories for women compared to men,
with people who are more gender prototypical in appearance being represented
more uniformly. By contrast, VLMs represent White Americans more uniformly than
Black Americans. Unlike with gender prototypicality, race prototypicality was
not related to stronger uniformity. In terms of trait associations, we find
limited evidence of stereotyping-Black Americans were consistently linked with
basketball across all models, while other racial associations (i.e., art,
healthcare, appearance) varied by specific VLM. These findings demonstrate that
VLM stereotyping manifests in ways that go beyond simple group membership,
suggesting that conventional bias mitigation strategies may be insufficient to
address VLM stereotyping and that homogeneity bias persists even when trait
associations are less apparent in model outputs.
☆ Fake It To Make It: Virtual Multiviews to Enhance Monocular Indoor Semantic Scene Completion IROS 2025
Monocular Indoor Semantic Scene Completion (SSC) aims to reconstruct a 3D
semantic occupancy map from a single RGB image of an indoor scene, inferring
spatial layout and object categories from 2D image cues. The challenge of this
task arises from the depth, scale, and shape ambiguities that emerge when
transforming a 2D image into 3D space, particularly within the complex and
often heavily occluded environments of indoor scenes. Current SSC methods often
struggle with these ambiguities, resulting in distorted or missing object
representations. To overcome these limitations, we introduce an innovative
approach that leverages novel view synthesis and multiview fusion.
Specifically, we demonstrate how virtual cameras can be placed around the scene
to emulate multiview inputs that enhance contextual scene information. We also
introduce a Multiview Fusion Adaptor (MVFA) to effectively combine the
multiview 3D scene predictions into a unified 3D semantic occupancy map.
Finally, we identify and study the inherent limitation of generative techniques
when applied to SSC, specifically the Novelty-Consistency tradeoff. Our system,
GenFuSE, demonstrates IoU score improvements of up to 2.8% for Scene Completion
and 4.9% for Semantic Scene Completion when integrated with existing SSC
networks on the NYUv2 dataset. This work introduces GenFuSE as a standard
framework for advancing monocular SSC with synthesized inputs.
comment: Submitted to IROS 2025
☆ Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs CVPR2025
Despite recent successes in novel view synthesis using 3D Gaussian Splatting
(3DGS), modeling scenes with sparse inputs remains a challenge. In this work,
we address two critical yet overlooked issues in real-world sparse-input
modeling: extrapolation and occlusion. To tackle these issues, we propose to
use a reconstruction by generation pipeline that leverages learned priors from
video diffusion models to provide plausible interpretations for regions outside
the field of view or occluded. However, the generated sequences exhibit
inconsistencies that do not fully benefit subsequent 3DGS modeling. To address
the challenge of inconsistencies, we introduce a novel scene-grounding guidance
based on rendered sequences from an optimized 3DGS, which tames the diffusion
model to generate consistent sequences. This guidance is training-free and does
not require any fine-tuning of the diffusion model. To facilitate holistic
scene modeling, we also propose a trajectory initialization method. It
effectively identifies regions that are outside the field of view and occluded.
We further design a scheme tailored for 3DGS optimization with generated
sequences. Experiments demonstrate that our method significantly improves upon
the baseline and achieves state-of-the-art performance on challenging
benchmarks.
comment: Accepted by CVPR2025. The project page is available at
https://zhongyingji.github.io/guidevd-3dgs/
☆ Lightweight Hypercomplex MRI Reconstruction: A Generalized Kronecker-Parameterized Approach
Magnetic Resonance Imaging (MRI) is crucial for clinical diagnostics but is
hindered by prolonged scan times. Current deep learning models enhance MRI
reconstruction but are often memory-intensive and unsuitable for
resource-limited systems. This paper introduces a lightweight MRI
reconstruction model leveraging Kronecker-Parameterized Hypercomplex Neural
Networks to achieve high performance with reduced parameters. By integrating
Kronecker-based modules, including Kronecker MLP, Kronecker Window Attention,
and Kronecker Convolution, the proposed model efficiently extracts spatial
features while preserving representational power. We introduce Kronecker U-Net
and Kronecker SwinMR, which maintain high reconstruction quality with
approximately 50% fewer parameters compared to existing models. Experimental
evaluation on the FastMRI dataset demonstrates competitive PSNR, SSIM, and
LPIPS metrics, even at high acceleration factors (8x and 16x), with no
significant performance drop. Additionally, Kronecker variants exhibit superior
generalization and reduced overfitting on limited datasets, facilitating
efficient MRI reconstruction on hardware-constrained systems. This approach
sets a new benchmark for parameter-efficient medical imaging models.
comment: 11 pages, 3 figures. Submitted for publication
☆ Accelerated Patient-specific Non-Cartesian MRI Reconstruction using Implicit Neural Representations
Di Xu, Hengjie Liu, Xin Miao, Daniel O'Connor, Jessica E. Scholey, Wensha Yang, Mary Feng, Michael Ohliger, Hui Lin, Dan Ruan, Yang Yang, Ke Sheng
The scanning time for a fully sampled MRI can be undesirably lengthy.
Compressed sensing has been developed to minimize image artifacts in
accelerated scans, but the required iterative reconstruction is computationally
complex and difficult to generalize on new cases. Image-domain-based deep
learning methods (e.g., convolutional neural networks) emerged as a faster
alternative but face challenges in modeling continuous k-space, a problem
amplified with non-Cartesian sampling commonly used in accelerated acquisition.
In comparison, implicit neural representations can model continuous signals in
the frequency domain and thus are compatible with arbitrary k-space sampling
patterns. The current study develops a novel generative-adversarially trained
implicit neural representations (k-GINR) for de novo undersampled non-Cartesian
k-space reconstruction. k-GINR consists of two stages: 1) supervised training
on an existing patient cohort; 2) self-supervised patient-specific
optimization. In stage 1, the network is trained with the
generative-adversarial network on diverse patients of the same anatomical
region supervised by fully sampled acquisition. In stage 2, undersampled
k-space data of individual patients is used to tailor the prior-embedded
network for patient-specific optimization. The UCSF StarVIBE T1-weighted liver
dataset was evaluated on the proposed framework. k-GINR is compared with an
image-domain deep learning method, Deep Cascade CNN, and a compressed sensing
method. k-GINR consistently outperformed the baselines with a larger
performance advantage observed at very high accelerations (e.g., 20 times).
k-GINR offers great value for direct non-Cartesian k-space reconstruction for
new incoming patients across a wide range of accelerations liver anatomy.
♻ ☆ Tell Me What to Track: Infusing Robust Language Guidance for Enhanced Referring Multi-Object Tracking
Referring multi-object tracking (RMOT) is an emerging cross-modal task that
aims to localize an arbitrary number of targets based on a language expression
and continuously track them in a video. This intricate task involves reasoning
on multi-modal data and precise target localization with temporal association.
However, prior studies overlook the imbalanced data distribution between
newborn targets and existing targets due to the nature of the task. In
addition, they only indirectly fuse multi-modal features, struggling to deliver
clear guidance on newborn target detection. To solve the above issues, we
conduct a collaborative matching strategy to alleviate the impact of the
imbalance, boosting the ability to detect newborn targets while maintaining
tracking performance. In the encoder, we integrate and enhance the cross-modal
and multi-scale fusion, overcoming the bottlenecks in previous work, where
limited multi-modal information is shared and interacted between feature maps.
In the decoder, we also develop a referring-infused adaptation that provides
explicit referring guidance through the query tokens. The experiments showcase
the superior performance of our model (+3.42%) compared to prior works,
demonstrating the effectiveness of our designs.
♻ ☆ Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation
Text-to-video diffusion models have shown remarkable progress in generating
coherent video clips from textual descriptions. However, the interplay between
motion, structure, and identity representations in these models remains
under-explored. Here, we investigate how self-attention query features (a.k.a.
Q features) simultaneously govern motion, structure, and identity and examine
the challenges arising when these representations interact. Our analysis
reveals that Q affects not only layout, but that during denoising Q also has a
strong effect on subject identity, making it hard to transfer motion without
the side-effect of transferring identity. Understanding this dual role enabled
us to control query feature injection (Q injection) and demonstrate two
applications: (1) a zero-shot motion transfer method that is 20 times more
efficient than existing approaches, and (2) a training-free technique for
consistent multi-shot video generation, where characters maintain identity
across multiple video shots while Q injection enhances motion fidelity.
comment: (1) Project page:
https://research.nvidia.com/labs/par/MotionByQueries/ (2) The methods and
results in section 5, "Consistent multi-shot video generation", are based on
the arXiv version 1 (v1) of this work. Here, in version 2 (v2), we extend and
further analyze those findings to efficient motion transfer
♻ ☆ NeRF-Aug: Data Augmentation for Robotics with Neural Radiance Fields
Training a policy that can generalize to unknown objects is a long standing
challenge within the field of robotics. The performance of a policy often drops
significantly in situations where an object in the scene was not seen during
training. To solve this problem, we present NeRF-Aug, a novel method that is
capable of teaching a policy to interact with objects that are not present in
the dataset. This approach differs from existing approaches by leveraging the
speed, photorealism, and 3D consistency of a neural radiance field for
augmentation. NeRF-Aug both creates more photorealistic data and runs 63%
faster than existing methods. We demonstrate the effectiveness of our method on
5 tasks with 9 novel objects that are not present in the expert demonstrations.
We achieve an average performance boost of 55.6% when comparing our method to
the next best method. You can see video results at https://nerf-aug.github.io.
♻ ☆ Real-Time Incremental Explanations for Object Detectors in Autonomous Driving
Object detectors are widely used in safety-critical real-time applications
such as autonomous driving. Explainability is especially important for
safety-critical applications, and due to the variety of object detectors and
their often proprietary nature, black-box explainability tools are needed.
However, existing black-box explainability tools for AI models rely on multiple
model calls, rendering them impractical for real-time use.
In this paper, we introduce IncX, an algorithm and a tool for real-time
black-box explainability for object detectors. The algorithm is based on linear
transformations of saliency maps, producing sufficient explanations. We
evaluate our implementation on four widely used video datasets of autonomous
driving and demonstrate that IncX's explanations are comparable in quality to
the state-of-the-art and are computed two orders of magnitude faster than the
state-of-the-art, making them usable in real time.
♻ ☆ DepthCues: Evaluating Monocular Depth Perception in Large Vision Models CVPR 2025
Large-scale pre-trained vision models are becoming increasingly prevalent,
offering expressive and generalizable visual representations that benefit
various downstream tasks. Recent studies on the emergent properties of these
models have revealed their high-level geometric understanding, in particular in
the context of depth perception. However, it remains unclear how depth
perception arises in these models without explicit depth supervision provided
during pre-training. To investigate this, we examine whether the monocular
depth cues, similar to those used by the human visual system, emerge in these
models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue
understanding, and present findings across 20 diverse and representative
pre-trained vision models. Our analysis shows that human-like depth cues emerge
in more recent larger models. We also explore enhancing depth perception in
large vision models by fine-tuning on DepthCues, and find that even without
dense depth supervision, this improves depth estimation. To support further
research, our benchmark and evaluation code will be made publicly available for
studying depth perception in vision models.
comment: Accepted to CVPR 2025. Project page:
https://danier97.github.io/depthcues/
♻ ☆ AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning CVPR2025
In recent years, the increasing popularity of Hi-DPI screens has driven a
rising demand for high-resolution images. However, the limited computational
power of edge devices poses a challenge in deploying complex super-resolution
neural networks, highlighting the need for efficient methods. While prior works
have made significant progress, they have not fully exploited pixel-level
information. Moreover, their reliance on fixed sampling patterns limits both
accuracy and the ability to capture fine details in low-resolution images. To
address these challenges, we introduce two plug-and-play modules designed to
capture and leverage pixel information effectively in Look-Up Table (LUT) based
super-resolution networks. Our method introduces Automatic Sampling
(AutoSample), a flexible LUT sampling approach where sampling weights are
automatically learned during training to adapt to pixel variations and expand
the receptive field without added inference cost. We also incorporate Adaptive
Residual Learning (AdaRL) to enhance inter-layer connections, enabling detailed
information flow and improving the network's ability to reconstruct fine
details. Our method achieves significant performance improvements on both MuLUT
and SPF-LUT while maintaining similar storage sizes. Specifically, for MuLUT,
we achieve a PSNR improvement of approximately +0.20 dB improvement on average
across five datasets. For SPF-LUT, with more than a 50% reduction in storage
space and about a 2/3 reduction in inference time, our method still maintains
performance comparable to the original. The code is available at
https://github.com/SuperKenVery/AutoLUT.
comment: Accepted by CVPR2025
♻ ★ Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset
Yingzi Ma, Jiongxiao Wang, Fei Wang, Siyuan Ma, Jiazhao Li, Jinsheng Pan, Xiujun Li, Furong Huang, Lichao Sun, Bo Li, Yejin Choi, Muhao Chen, Chaowei Xiao
Machine unlearning has emerged as an effective strategy for forgetting
specific information in the training data. However, with the increasing
integration of visual data, privacy concerns in Vision Language Models (VLMs)
remain underexplored. To address this, we introduce Facial Identity Unlearning
Benchmark (FIUBench), a novel VLM unlearning benchmark designed to robustly
evaluate the effectiveness of unlearning algorithms under the Right to be
Forgotten setting. Specifically, we formulate the VLM unlearning task via
constructing the Fictitious Facial Identity VQA dataset and apply a two-stage
evaluation pipeline that is designed to precisely control the sources of
information and their exposure levels. In terms of evaluation, since VLM
supports various forms of ways to ask questions with the same semantic meaning,
we also provide robust evaluation metrics including membership inference
attacks and carefully designed adversarial privacy attacks to evaluate the
performance of algorithms. Through the evaluation of four baseline VLM
unlearning algorithms within FIUBench, we find that all methods remain limited
in their unlearning performance, with significant trade-offs between model
utility and forget quality. Furthermore, our findings also highlight the
importance of privacy attacks for robust evaluations. We hope FIUBench will
drive progress in developing more effective VLM unlearning algorithms.
♻ ☆ Spatial regularisation for improved accuracy and interpretability in keypoint-based registration
Benjamin Billot, Ramya Muthukrishnan, Esra Abaci-Turk, P. Ellen Grant, Nicholas Ayache, Hervé Delingette, Polina Golland
Unsupervised registration strategies bypass requirements in ground truth
transforms or segmentations by optimising similarity metrics between fixed and
moved volumes. Among these methods, a recent subclass of approaches based on
unsupervised keypoint detection stand out as very promising for
interpretability. Specifically, these methods train a network to predict
feature maps for fixed and moving images, from which explainable centres of
mass are computed to obtain point clouds, that are then aligned in closed-form.
However, the features returned by the network often yield spatially diffuse
patterns that are hard to interpret, thus undermining the purpose of
keypoint-based registration. Here, we propose a three-fold loss to regularise
the spatial distribution of the features. First, we use the KL divergence to
model features as point spread functions that we interpret as probabilistic
keypoints. Then, we sharpen the spatial distributions of these features to
increase the precision of the detected landmarks. Finally, we introduce a new
repulsive loss across keypoints to encourage spatial diversity. Overall, our
loss considerably improves the interpretability of the features, which now
correspond to precise and anatomically meaningful landmarks. We demonstrate our
three-fold loss in foetal rigid motion tracking and brain MRI affine
registration tasks, where it not only outperforms state-of-the-art unsupervised
strategies, but also bridges the gap with state-of-the-art supervised methods.
Our code is available at https://github.com/BenBillot/spatial_regularisation.
comment: under review
♻ ☆ Depth Completion with Multiple Balanced Bases and Confidence for Dense Monocular SLAM
Weijian Xie, Guanyi Chu, Quanhao Qian, Yihao Yu, Hai Li, Danpeng Chen, Shangjin Zhai, Nan Wang, Hujun Bao, Guofeng Zhang
Dense SLAM based on monocular cameras does indeed have immense application
value in the field of AR/VR, especially when it is performed on a mobile
device. In this paper, we propose a novel method that integrates a light-weight
depth completion network into a sparse SLAM system using a multi-basis depth
representation, so that dense mapping can be performed online even on a mobile
phone. Specifically, we present a specifically optimized multi-basis depth
completion network, called BBC-Net, tailored to the characteristics of
traditional sparse SLAM systems. BBC-Net can predict multiple balanced bases
and a confidence map from a monocular image with sparse points generated by
off-the-shelf keypoint-based SLAM systems. The final depth is a linear
combination of predicted depth bases that can be optimized by tuning the
corresponding weights. To seamlessly incorporate the weights into traditional
SLAM optimization and ensure efficiency and robustness, we design a set of
depth weight factors, which makes our network a versatile plug-in module,
facilitating easy integration into various existing sparse SLAM systems and
significantly enhancing global depth consistency through bundle adjustment. To
verify the portability of our method, we integrate BBC-Net into two
representative SLAM systems. The experimental results on various datasets show
that the proposed method achieves better performance in monocular dense mapping
than the state-of-the-art methods. We provide an online demo running on a
mobile phone, which verifies the efficiency and mapping quality of the proposed
method in real-world scenarios.
♻ ☆ MicroMIL: Graph-based Contextual Multiple Instance Learning for Patient Diagnosis Using Microscopy Images
Cancer diagnosis has greatly benefited from the integration of whole-slide
images (WSIs) with multiple instance learning (MIL), enabling high-resolution
analysis of tissue morphology. Graph-based MIL (GNN-MIL) approaches have
emerged as powerful solutions for capturing spatial and relational structures
in WSIs, thereby improving diagnostic accuracy. However, despite their
effectiveness, WSIs require significant computational and infrastructural
resources, limiting accessibility in resource-constrained settings. Microscopy
imaging provides a cost-effective alternative, but applying GNN-MIL to
microscopy imaging is challenging due to the absence of spatial coordinates and
the high redundancy in pathologist-acquired images. To address these issues, we
introduce MicroMIL, the first weakly-supervised MIL framework specifically
designed for microscopy imaging. MicroMIL leverages a representative image
extractor (RIE) that employs deep cluster embedding (DCE) and hard
Gumbel-Softmax to dynamically reduce redundancy and select representative
images. These selected images serve as graph nodes, with edges determined by
cosine similarity, eliminating the need for spatial coordinates while
preserving relational structure. Extensive experiments on a real-world colon
cancer dataset and the BreakHis dataset demonstrate that MicroMIL achieves
state-of-the-art performance, improving both diagnostic accuracy and robustness
to redundancy. The code is available at
https://anonymous.4open.science/r/MicroMIL-6C7C
comment: The first two authors contributed equally to this work
♻ ☆ Completion as Enhancement: A Degradation-Aware Selective Image Guided Network for Depth Completion CVPR 2025
In this paper, we introduce the Selective Image Guided Network (SigNet), a
novel degradation-aware framework that transforms depth completion into depth
enhancement for the first time. Moving beyond direct completion using
convolutional neural networks (CNNs), SigNet initially densifies sparse depth
data through non-CNN densification tools to obtain coarse yet dense depth. This
approach eliminates the mismatch and ambiguity caused by direct convolution
over irregularly sampled sparse data. Subsequently, SigNet redefines completion
as enhancement, establishing a self-supervised degradation bridge between the
coarse depth and the targeted dense depth for effective RGB-D fusion. To
achieve this, SigNet leverages the implicit degradation to adaptively select
high-frequency components (e.g., edges) of RGB data to compensate for the
coarse depth. This degradation is further integrated into a multi-modal
conditional Mamba, dynamically generating the state parameters to enable
efficient global high-frequency information interaction. We conduct extensive
experiments on the NYUv2, DIML, SUN RGBD, and TOFDC datasets, demonstrating the
state-of-the-art (SOTA) performance of SigNet.
comment: CVPR 2025
♻ ☆ PRAM: Place Recognition Anywhere Model for Efficient Visual Localization
Visual localization is a key technique to a variety of applications, e.g.,
autonomous driving, AR/VR, and robotics. For these real applications, both
efficiency and accuracy are important especially on edge devices with limited
computing resources. However, previous frameworks, e.g., absolute pose
regression (APR), scene coordinate regression (SCR), and the hierarchical
method (HM), have limited either accuracy or efficiency in both indoor and
outdoor environments. In this paper, we propose the place recognition anywhere
model (PRAM), a new framework, to perform visual localization efficiently and
accurately by recognizing 3D landmarks. Specifically, PRAM first generates
landmarks directly in 3D space in a self-supervised manner. Without relying on
commonly used classic semantic labels, these 3D landmarks can be defined in any
place in indoor and outdoor scenes with higher generalization ability.
Representing the map with 3D landmarks, PRAM discards global descriptors,
repetitive local descriptors, and redundant 3D points, increasing the memory
efficiency significantly. Then, sparse keypoints, rather than dense pixels, are
utilized as the input tokens to a transformer-based recognition module for
landmark recognition, which enables PRAM to recognize hundreds of landmarks
with high time and memory efficiency. At test time, sparse keypoints and
predicted landmark labels are utilized for outlier removal and landmark-wise
2D-3D matching as opposed to exhaustive 2D-2D matching, which further increases
the time efficiency. A comprehensive evaluation of APRs, SCRs, HMs, and PRAM on
both indoor and outdoor datasets demonstrates that PRAM outperforms ARPs and
SCRs in large-scale scenes with a large margin and gives competitive accuracy
to HMs but reduces over 90\% memory cost and runs 2.4 times faster, leading to
a better balance between efficiency and accuracy.
comment: project page: https://feixue94.github.io/pram-project/
♻ ☆ OASIS Uncovers: High-Quality T2I Models, Same Old Stereotypes ICLR 2025
Images generated by text-to-image (T2I) models often exhibit visual biases
and stereotypes of concepts such as culture and profession. Existing
quantitative measures of stereotypes are based on statistical parity that does
not align with the sociological definition of stereotypes and, therefore,
incorrectly categorizes biases as stereotypes. Instead of oversimplifying
stereotypes as biases, we propose a quantitative measure of stereotypes that
aligns with its sociological definition. We then propose OASIS to measure the
stereotypes in a generated dataset and understand their origins within the T2I
model. OASIS includes two scores to measure stereotypes from a generated image
dataset: (M1) Stereotype Score to measure the distributional violation of
stereotypical attributes, and (M2) WALS to measure spectral variance in the
images along a stereotypical attribute. OASIS also includes two methods to
understand the origins of stereotypes in T2I models: (U1) StOP to discover
attributes that the T2I model internally associates with a given concept, and
(U2) SPI to quantify the emergence of stereotypical attributes in the latent
space of the T2I model during image generation. Despite the considerable
progress in image fidelity, using OASIS, we conclude that newer T2I models such
as FLUX.1 and SDv3 contain strong stereotypical predispositions about concepts
and still generate images with widespread stereotypical attributes.
Additionally, the quantity of stereotypes worsens for nationalities with lower
Internet footprints.
comment: Accepted as a Spotlight paper at ICLR 2025
♻ ☆ ATRNet-STAR: A Large Dataset and Benchmark Towards Remote Sensing Object Recognition in the Wild
Yongxiang Liu, Weijie Li, Li Liu, Jie Zhou, Bowen Peng, Yafei Song, Xuying Xiong, Wei Yang, Tianpeng Liu, Zhen Liu, Xiang Li
The absence of publicly available, large-scale, high-quality datasets for
Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) has
significantly hindered the application of rapidly advancing deep learning
techniques, which hold huge potential to unlock new capabilities in this field.
This is primarily because collecting large volumes of diverse target samples
from SAR images is prohibitively expensive, largely due to privacy concerns,
the characteristics of microwave radar imagery perception, and the need for
specialized expertise in data annotation. Throughout the history of SAR ATR
research, there have been only a number of small datasets, mainly including
targets like ships, airplanes, buildings, etc. There is only one vehicle
dataset MSTAR collected in the 1990s, which has been a valuable source for SAR
ATR. To fill this gap, this paper introduces a large-scale, new dataset named
ATRNet-STAR with 40 different vehicle categories collected under various
realistic imaging conditions and scenes. It marks a substantial advancement in
dataset scale and diversity, comprising over 190,000 well-annotated samples, 10
times larger than its predecessor, the famous MSTAR. Building such a large
dataset is a challenging task, and the data collection scheme will be detailed.
Secondly, we illustrate the value of ATRNet-STAR via extensively evaluating the
performance of 15 representative methods with 7 different experimental settings
on challenging classification and detection benchmarks derived from the
dataset. Finally, based on our extensive experiments, we identify valuable
insights for SAR ATR and discuss potential future research directions in this
field. We hope that the scale, diversity, and benchmark of ATRNet-STAR can
significantly facilitate the advancement of SAR ATR.
comment: 17 pages, 14 figures; ATRNet-STAR:
https://github.com/waterdisappear/ATRNet-STAR
♻ ☆ Vulnerabilities in AI-generated Image Detection: The Challenge of Adversarial Attacks
Recent advancements in image synthesis, particularly with the advent of GAN
and Diffusion models, have amplified public concerns regarding the
dissemination of disinformation. To address such concerns, numerous
AI-generated Image (AIGI) Detectors have been proposed and achieved promising
performance in identifying fake images. However, there still lacks a systematic
understanding of the adversarial robustness of AIGI detectors. In this paper,
we examine the vulnerability of state-of-the-art AIGI detectors against
adversarial attack under white-box and black-box settings, which has been
rarely investigated so far. To this end, we propose a new method to attack AIGI
detectors. First, inspired by the obvious difference between real images and
fake images in the frequency domain, we add perturbations under the frequency
domain to push the image away from its original frequency distribution. Second,
we explore the full posterior distribution of the surrogate model to further
narrow this gap between heterogeneous AIGI detectors, e.g. transferring
adversarial examples across CNNs and ViTs. This is achieved by introducing a
novel post-train Bayesian strategy that turns a single surrogate into a
Bayesian one, capable of simulating diverse victim models using one pre-trained
surrogate, without the need for re-training. We name our method as
Frequency-based Post-train Bayesian Attack, or FPBA. Through FPBA, we show that
adversarial attack is truly a real threat to AIGI detectors, because FPBA can
deliver successful black-box attacks across models, generators, defense
methods, and even evade cross-generator detection, which is a crucial
real-world detection scenario. The code will be shared upon acceptance.
♻ ☆ MVCTrack: Boosting 3D Point Cloud Tracking via Multimodal-Guided Virtual Cues ICRA 2025
3D single object tracking is essential in autonomous driving and robotics.
Existing methods often struggle with sparse and incomplete point cloud
scenarios. To address these limitations, we propose a Multimodal-guided Virtual
Cues Projection (MVCP) scheme that generates virtual cues to enrich sparse
point clouds. Additionally, we introduce an enhanced tracker MVCTrack based on
the generated virtual cues. Specifically, the MVCP scheme seamlessly integrates
RGB sensors into LiDAR-based systems, leveraging a set of 2D detections to
create dense 3D virtual cues that significantly improve the sparsity of point
clouds. These virtual cues can naturally integrate with existing LiDAR-based 3D
trackers, yielding substantial performance gains. Extensive experiments
demonstrate that our method achieves competitive performance on the NuScenes
dataset.
comment: Accepted by ICRA 2025
♻ ☆ Toward Robust Non-Transferable Learning: A Survey and Benchmark
Over the past decades, researchers have primarily focused on improving the
generalization abilities of models, with limited attention given to regulating
such generalization. However, the ability of models to generalize to unintended
data (e.g., harmful or unauthorized data) can be exploited by malicious
adversaries in unforeseen ways, potentially resulting in violations of model
ethics. Non-transferable learning (NTL), a task aimed at reshaping the
generalization abilities of deep learning models, was proposed to address these
challenges. While numerous methods have been proposed in this field, a
comprehensive review of existing progress and a thorough analysis of current
limitations remain lacking. In this paper, we bridge this gap by presenting the
first comprehensive survey on NTL and introducing NTLBench, the first benchmark
to evaluate NTL performance and robustness within a unified framework.
Specifically, we first introduce the task settings, general framework, and
criteria of NTL, followed by a summary of NTL approaches. Furthermore, we
emphasize the often-overlooked issue of robustness against various attacks that
can destroy the non-transferable mechanism established by NTL. Experiments
conducted via NTLBench verify the limitations of existing NTL methods in
robustness. Finally, we discuss the practical applications of NTL, along with
its future directions and associated challenges.
comment: Code is available at https://github.com/tmllab/NTLBench
♻ ☆ Revisiting the Generalization Problem of Low-level Vision Models Through the Lens of Image Deraining
Generalization remains a significant challenge for low-level vision models,
which often struggle with unseen degradations in real-world scenarios despite
their success in controlled benchmarks. In this paper, we revisit the
generalization problem in low-level vision models. Image deraining is selected
as a case study due to its well-defined and easily decoupled structure,
allowing for more effective observation and analysis. Through comprehensive
experiments, we reveal that the generalization issue is not primarily due to
limited network capacity but rather the failure of existing training
strategies, which leads networks to overfit specific degradation patterns. Our
findings show that guiding networks to focus on learning the underlying image
content, rather than the degradation patterns, is key to improving
generalization. We demonstrate that balancing the complexity of background
images and degradations in the training data helps networks better fit the
image distribution. Furthermore, incorporating content priors from pre-trained
generative models significantly enhances generalization. Experiments on both
image deraining and image denoising validate the proposed strategies. We
believe the insights and solutions will inspire further research and improve
the generalization of low-level vision models.
comment: arXiv admin note: substantial text overlap with arXiv:2305.15134
♻ ☆ A Simple and Generalist Approach for Panoptic Segmentation
Panoptic segmentation is an important computer vision task, where the current
state-of-the-art solutions require specialized components to perform well. We
propose a simple generalist framework based on a deep encoder - shallow decoder
architecture with per-pixel prediction. Essentially fine-tuning a massively
pretrained image model with minimal additional components. Naively this method
does not yield good results. We show that this is due to imbalance during
training and propose a novel method for reducing it - centroid regression in
the space of spectral positional embeddings. Our method achieves panoptic
quality (PQ) of 55.1 on the challenging MS-COCO dataset, state-of-the-art
performance among generalist methods.
♻ ☆ Analysis of the BraTS 2023 Intracranial Meningioma Segmentation Challenge MICCAI
Dominic LaBella, Ujjwal Baid, Omaditya Khanna, Shan McBurney-Lin, Ryan McLean, Pierre Nedelec, Arif Rashid, Nourel Hoda Tahon, Talissa Altes, Radhika Bhalerao, Yaseen Dhemesh, Devon Godfrey, Fathi Hilal, Scott Floyd, Anastasia Janas, Anahita Fathi Kazerooni, John Kirkpatrick, Collin Kent, Florian Kofler, Kevin Leu, Nazanin Maleki, Bjoern Menze, Maxence Pajot, Zachary J. Reitman, Jeffrey D. Rudie, Rachit Saluja, Yury Velichko, Chunhao Wang, Pranav Warman, Maruf Adewole, Jake Albrecht, Udunna Anazodo, Syed Muhammad Anwar, Timothy Bergquist, Sully Francis Chen, Verena Chung, Rong Chai, Gian-Marco Conte, Farouk Dako, James Eddy, Ivan Ezhov, Nastaran Khalili, Juan Eugenio Iglesias, Zhifan Jiang, Elaine Johanson, Koen Van Leemput, Hongwei Bran Li, Marius George Linguraru, Xinyang Liu, Aria Mahtabfar, Zeke Meier, Ahmed W. Moawad, John Mongan, Marie Piraud, Russell Takeshi Shinohara, Walter F. Wiggins, Aly H. Abayazeed, Rachel Akinola, András Jakab, Michel Bilello, Maria Correia de Verdier, Priscila Crivellaro, Christos Davatzikos, Keyvan Farahani, John Freymann, Christopher Hess, Raymond Huang, Philipp Lohmann, Mana Moassefi, Matthew W. Pease, Phillipp Vollmuth, Nico Sollmann, David Diffley, Khanak K. Nandolia, Daniel I. Warren, Ali Hussain, Pascal Fehringer, Yulia Bronstein, Lisa Deptula, Evan G. Stein, Mahsa Taherzadeh, Eduardo Portela de Oliveira, Aoife Haughey, Marinos Kontzialis, Luca Saba, Benjamin Turner, Melanie M. T. Brüßeler, Shehbaz Ansari, Athanasios Gkampenis, David Maximilian Weiss, Aya Mansour, Islam H. Shawali, Nikolay Yordanov, Joel M. Stein, Roula Hourani, Mohammed Yahya Moshebah, Ahmed Magdy Abouelatta, Tanvir Rizvi, Klara Willms, Dann C. Martin, Abdullah Okar, Gennaro D'Anna, Ahmed Taha, Yasaman Sharifi, Shahriar Faghani, Dominic Kite, Marco Pinho, Muhammad Ammar Haider, Alejandro Aristizabal, Alexandros Karargyris, Hasan Kassem, Sarthak Pati, Micah Sheller, Michelle Alonso-Basanta, Javier Villanueva-Meyer, Andreas M. Rauschecker, Ayman Nada, Mariam Aboian, Adam E. Flanders, Benedikt Wiestler, Spyridon Bakas, Evan Calabrese
We describe the design and results from the BraTS 2023 Intracranial
Meningioma Segmentation Challenge. The BraTS Meningioma Challenge differed from
prior BraTS Glioma challenges in that it focused on meningiomas, which are
typically benign extra-axial tumors with diverse radiologic and anatomical
presentation and a propensity for multiplicity. Nine participating teams each
developed deep-learning automated segmentation models using image data from the
largest multi-institutional systematically expert annotated multilabel
multi-sequence meningioma MRI dataset to date, which included 1000 training set
cases, 141 validation set cases, and 283 hidden test set cases. Each case
included T2, FLAIR, T1, and T1Gd brain MRI sequences with associated tumor
compartment labels delineating enhancing tumor, non-enhancing tumor, and
surrounding non-enhancing FLAIR hyperintensity. Participant automated
segmentation models were evaluated and ranked based on a scoring system
evaluating lesion-wise metrics including dice similarity coefficient (DSC) and
95% Hausdorff Distance. The top ranked team had a lesion-wise median dice
similarity coefficient (DSC) of 0.976, 0.976, and 0.964 for enhancing tumor,
tumor core, and whole tumor, respectively and a corresponding average DSC of
0.899, 0.904, and 0.871, respectively. These results serve as state-of-the-art
benchmarks for future pre-operative meningioma automated segmentation
algorithms. Additionally, we found that 1286 of 1424 cases (90.3%) had at least
1 compartment voxel abutting the edge of the skull-stripped image edge, which
requires further investigation into optimal pre-processing face anonymization
steps.
comment: Accepted for publication at the Journal of Machine Learning for
Biomedical Imaging (MELBA) https://melba-journal.org/2025:003 22 pages, 6
tables, 12 figures, MICCAI, MELBA
♻ ☆ RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model
Huiyang Hu, Peijin Wang, Hanbo Bi, Boyuan Tong, Zhaozhi Wang, Wenhui Diao, Hao Chang, Yingchao Feng, Ziqi Zhang, Yaowei Wang, Qixiang Ye, Kun Fu, Xian Sun
Remote sensing foundation models largely break away from the traditional
paradigm of designing task-specific models, offering greater scalability across
multiple tasks. However, they face challenges such as low computational
efficiency and limited interpretability, especially when dealing with
large-scale remote sensing images. To overcome these, we draw inspiration from
heat conduction, a physical process modeling local heat diffusion. Building on
this idea, we are the first to explore the potential of using the parallel
computing model of heat conduction to simulate the local region correlations in
high-resolution remote sensing images, and introduce RS-vHeat, an efficient
multi-modal remote sensing foundation model. Specifically, RS-vHeat 1) applies
the Heat Conduction Operator (HCO) with a complexity of $O(N^{1.5})$ and a
global receptive field, reducing computational overhead while capturing remote
sensing object structure information to guide heat diffusion; 2) learns the
frequency distribution representations of various scenes through a
self-supervised strategy based on frequency domain hierarchical masking and
multi-domain reconstruction; 3) significantly improves efficiency and
performance over state-of-the-art techniques across 4 tasks and 10 datasets.
Compared to attention-based remote sensing foundation models, we reduce memory
usage by 84\%, FLOPs by 24\% and improves throughput by 2.7 times. The code
will be made publicly available.
comment: 19 pages, 8 figures and 10 tables
♻ ☆ Language-guided Medical Image Segmentation with Target-informed Multi-level Contrastive Alignments
Medical image segmentation is crucial in modern medical image analysis, which
can aid into diagnosis of various disease conditions. Recently, language-guided
segmentation methods have shown promising results in automating image
segmentation where text reports are incorporated as guidance. These text
reports, containing image impressions and insights given by clinicians,
provides auxiliary guidance. However, these methods neglect the inherent
pattern gaps between the two distinct modalities, which leads to sub-optimal
image-text feature fusion without proper cross-modality feature alignments.
Contrastive alignments are widely used to associate image-text semantics in
representation learning; however, it has not been exploited to bridge the
pattern gaps in language-guided segmentation that relies on subtle low level
image details to represent diseases. Existing contrastive alignment methods
typically algin high-level global image semantics without involving low-level,
localized target information, and therefore fails to explore fine-grained text
guidance for language-guided segmentation. In this study, we propose a
language-guided segmentation network with Target-informed Multi-level
Contrastive Alignments (TMCA). TMCA enables target-informed cross-modality
alignments and fine-grained text guidance to bridge the pattern gaps in
language-guided segmentation. Specifically, we introduce: 1) a target-sensitive
semantic distance module that enables granular image-text alignment modelling,
and 2) a multi-level alignment strategy that directs text guidance on low-level
image features. In addition, a language-guided target enhancement module is
proposed to leverage the aligned text to redirect attention to focus on
critical localized image features. Extensive experiments on 4 image-text
datasets, involving 3 medical imaging modalities, demonstrated that our TMCA
achieved superior performances.
♻ ☆ A Survey on 3D Gaussian Splatting
3D Gaussian splatting (GS) has emerged as a transformative technique in
explicit radiance field and computer graphics. This innovative approach,
characterized by the use of millions of learnable 3D Gaussians, represents a
significant departure from mainstream neural radiance field approaches, which
predominantly use implicit, coordinate-based models to map spatial coordinates
to pixel values. 3D GS, with its explicit scene representation and
differentiable rendering algorithm, not only promises real-time rendering
capability but also introduces unprecedented levels of editability. This
positions 3D GS as a potential game-changer for the next generation of 3D
reconstruction and representation. In the present paper, we provide the first
systematic overview of the recent developments and critical contributions in
the domain of 3D GS. We begin with a detailed exploration of the underlying
principles and the driving forces behind the emergence of 3D GS, laying the
groundwork for understanding its significance. A focal point of our discussion
is the practical applicability of 3D GS. By enabling unprecedented rendering
speed, 3D GS opens up a plethora of applications, ranging from virtual reality
to interactive media and beyond. This is complemented by a comparative analysis
of leading 3D GS models, evaluated across various benchmark tasks to highlight
their performance and practical utility. The survey concludes by identifying
current challenges and suggesting potential avenues for future research.
Through this survey, we aim to provide a valuable resource for both newcomers
and seasoned researchers, fostering further exploration and advancement in
explicit radiance field.
comment: Ongoing project. Paper list:
https://github.com/guikunchen/Awesome3DGS ; Benchmark:
https://github.com/guikunchen/3DGS-Benchmarks
♻ ☆ Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes 3DV 2025
State-of-the-art novel view synthesis methods achieve impressive results for
multi-view captures of static 3D scenes. However, the reconstructed scenes
still lack "liveliness," a key component for creating engaging 3D experiences.
Recently, novel video diffusion models generate realistic videos with complex
motion and enable animations of 2D images, however they cannot naively be used
to animate 3D scenes as they lack multi-view consistency. To breathe life into
the static world, we propose Gaussians2Life, a method for animating parts of
high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is
to leverage powerful video diffusion models as the generative component of our
model and to combine these with a robust technique to lift 2D videos into
meaningful 3D motion. We find that, in contrast to prior work, this enables
realistic animations of complex, pre-existing 3D scenes and further enables the
animation of a large variety of object classes, while related work is mostly
focused on prior-based character animation, or single 3D objects. Our model
enables the creation of consistent, immersive 3D experiences for arbitrary
scenes.
comment: Project website at https://wimmerth.github.io/gaussians2life.html.
Accepted to 3DV 2025
♻ ☆ Sustainable transparency in Recommender Systems: Bayesian Ranking of Images for Explainability
Recommender Systems have become crucial in the modern world, commonly guiding
users towards relevant content or products, and having a large influence over
the decisions of users and citizens. However, ensuring transparency and user
trust in these systems remains a challenge; personalized explanations have
emerged as a solution, offering justifications for recommendations. Among the
existing approaches for generating personalized explanations, using existing
visual content created by users is a promising option to maximize transparency
and user trust. State-of-the-art models that follow this approach, despite
leveraging highly optimized architectures, employ surrogate learning tasks that
do not efficiently model the objective of ranking images as explanations for a
given recommendation; this leads to a suboptimal training process with high
computational costs that may not be reduced without affecting model
performance. This work presents BRIE, a novel model where we leverage Bayesian
Pairwise Ranking to enhance the training process, allowing us to consistently
outperform state-of-the-art models in six real-world datasets while reducing
its model size by up to 64 times and its CO2 emissions by up to 75% in training
and inference.
♻ ☆ Bridging Text and Vision: A Multi-View Text-Vision Registration Approach for Cross-Modal Place Recognition
Mobile robots necessitate advanced natural language understanding
capabilities to accurately identify locations and perform tasks such as package
delivery. However, traditional visual place recognition (VPR) methods rely
solely on single-view visual information and cannot interpret human language
descriptions. To overcome this challenge, we bridge text and vision by
proposing a multiview (360{\deg} views of the surroundings) text-vision
registration approach called Text4VPR for place recognition task, which is the
first method that exclusively utilizes textual descriptions to match a database
of images. Text4VPR employs the frozen T5 language model to extract global
textual embeddings. Additionally, it utilizes the Sinkhorn algorithm with
temperature coefficient to assign local tokens to their respective clusters,
thereby aggregating visual descriptors from images. During the training stage,
Text4VPR emphasizes the alignment between individual text-image pairs for
precise textual description. In the inference stage, Text4VPR uses the Cascaded
Cross-Attention Cosine Alignment (CCCA) to address the internal mismatch
between text and image groups. Subsequently, Text4VPR performs precisely place
match based on the descriptions of text-image groups. On Street360Loc, the
first text to image VPR dataset we created, Text4VPR builds a robust baseline,
achieving a leading top-1 accuracy of 57% and a leading top-10 accuracy of 92%
within a 5-meter radius on the test set, which indicates that localization from
textual descriptions to images is not only feasible but also holds significant
potential for further advancement, as shown in Figure 1.
comment: 8 pages, 4 figures, conference
♻ ☆ A Hybrid SNN-ANN Network for Event-based Object Detection with Spatial and Temporal AttentionEfficient Event-Based Object Detection: A Hybrid Neural Network with Spatial and Temporal Attention
Event cameras offer high temporal resolution and dynamic range with minimal
motion blur, making them promising for robust object detection. While Spiking
Neural Networks (SNNs) on neuromorphic hardware are often considered for energy
efficient and low latency event-based data processing, they often fall short of
Artificial Neural Networks (ANNs) in accuracy and flexibility. Here, we
introduce Attention-based Hybrid SNN-ANN backbones for event-based object
detection to leverage the strengths of both SNN and ANN architectures. A novel
Attention-based SNN-ANN bridge module captures sparse spatial and temporal
relations from the SNN layer and converts them into dense feature maps for the
ANN part of the backbone. Additionally, we present a variant that integrates
DWConvLSTMs to the ANN blocks to capture slower dynamics. This multi-timescale
network combines fast SNN processing for short timesteps with long-term dense
RNN processing, effectively capturing both fast and slow dynamics. Experimental
results demonstrate that our proposed method surpasses SNN-based approaches by
significant margins, with results comparable to existing ANN and RNN-based
methods. Unlike ANN-only networks, the hybrid setup allows us to implement the
SNN blocks on digital neuromorphic hardware to investigate the feasibility of
our approach. Extensive ablation studies and implementation on neuromorphic
hardware confirm the effectiveness of our proposed modules and architectural
choices. Our hybrid SNN-ANN architectures pave the way for ANN-like performance
at a drastically reduced parameter, latency, and power budget.
♻ ☆ General Detection-based Text Line Recognition
We introduce a general detection-based approach to text line recognition, be
it printed (OCR) or handwritten (HTR), with Latin, Chinese, or ciphered
characters. Detection-based approaches have until now been largely discarded
for HTR because reading characters separately is often challenging, and
character-level annotation is difficult and expensive. We overcome these
challenges thanks to three main insights: (i) synthetic pre-training with
sufficiently diverse data enables learning reasonable character localization
for any script; (ii) modern transformer-based detectors can jointly detect a
large number of instances, and, if trained with an adequate masking strategy,
leverage consistency between the different detections; (iii) once a pre-trained
detection model with approximate character localization is available, it is
possible to fine-tune it with line-level annotation on real data, even with a
different alphabet. Our approach, dubbed DTLR, builds on a completely different
paradigm than state-of-the-art HTR methods, which rely on autoregressive
decoding, predicting character values one by one, while we treat a complete
line in parallel. Remarkably, we demonstrate good performance on a large range
of scripts, usually tackled with specialized approaches. In particular, we
improve state-of-the-art performances for Chinese script recognition on the
CASIA v2 dataset, and for cipher recognition on the Borg and Copiale datasets.
Our code and models are available at https://github.com/raphael-baena/DTLR.
♻ ☆ METDrive: Multi-modal End-to-end Autonomous Driving with Temporal Guidance ICRA
Multi-modal end-to-end autonomous driving has shown promising advancements in
recent work. By embedding more modalities into end-to-end networks, the
system's understanding of both static and dynamic aspects of the driving
environment is enhanced, thereby improving the safety of autonomous driving. In
this paper, we introduce METDrive, an end-to-end system that leverages temporal
guidance from the embedded time series features of ego states, including
rotation angles, steering, throttle signals, and waypoint vectors. The
geometric features derived from perception sensor data and the time series
features of ego state data jointly guide the waypoint prediction with the
proposed temporal guidance loss function. We evaluated METDrive on the CARLA
leaderboard benchmarks, achieving a driving score of 70%, a route completion
score of 94%, and an infraction score of 0.78.
comment: Accepted by ICRA
♻ ☆ A Simple yet Effective Test-Time Adaptation for Zero-Shot Monocular Metric Depth Estimation
The recent development of foundation models for monocular depth estimation
such as Depth Anything paved the way to zero-shot monocular depth estimation.
Since it returns an affine-invariant disparity map, the favored technique to
recover the metric depth consists in fine-tuning the model. However, this stage
is not straightforward, it can be costly and time-consuming because of the
training and the creation of the dataset. The latter must contain images
captured by the camera that will be used at test time and the corresponding
ground truth. Moreover, the fine-tuning may also degrade the generalizing
capacity of the original model. Instead, we propose in this paper a new method
to rescale Depth Anything predictions using 3D points provided by sensors or
techniques such as low-resolution LiDAR or structure-from-motion with poses
given by an IMU. This approach avoids fine-tuning and preserves the
generalizing power of the original depth estimation model while being robust to
the noise of the sparse depth or of the depth model. Our experiments highlight
enhancements relative to zero-shot monocular metric depth estimation methods,
competitive results compared to fine-tuned approaches and a better robustness
than depth completion approaches. Code available at
https://gitlab.ensta.fr/ssh/monocular-depth-rescaling.
♻ ☆ SimpleDepthPose: Fast and Reliable Human Pose Estimation with RGBD-Images
In the rapidly advancing domain of computer vision, accurately estimating the
poses of multiple individuals from various viewpoints remains a significant
challenge, especially when reliability is a key requirement. This paper
introduces a novel algorithm that excels in multi-view, multi-person pose
estimation by incorporating depth information. An extensive evaluation
demonstrates that the proposed algorithm not only generalizes well to unseen
datasets, and shows a fast runtime performance, but also is adaptable to
different keypoints. To support further research, all of the work is publicly
accessible.
♻ ☆ GBT-SAM: A Parameter-Efficient Depth-Aware Model for Generalizable Brain tumour Segmentation on mp-MRI
Cecilia Diana-Albelda, Roberto Alcover-Couso, Álvaro García-Martín, Jesus Bescos, Marcos Escudero-Viñolo
Gliomas are brain tumours that stand out for their highly lethal and
aggressive nature, which demands a precise approach in their diagnosis. Medical
image segmentation plays a crucial role in the evaluation and follow-up of
these tumours, allowing specialists to analyse their morphology. However,
existing methods for automatic glioma segmentation often lack generalization
capability across other brain tumour domains, require extensive computational
resources, or fail to fully utilize the multi-parametric MRI (mp-MRI) data used
to delineate them. In this work, we introduce GBT-SAM, a novel Generalizable
Brain Tumour (GBT) framework that extends the Segment Anything Model (SAM) to
brain tumour segmentation tasks. Our method employs a two-step training
protocol: first, fine-tuning the patch embedding layer to process the entire
mp-MRI modalities, and second, incorporating parameter-efficient LoRA blocks
and a Depth-Condition block into the Vision Transformer (ViT) to capture
inter-slice correlations. GBT-SAM achieves state-of-the-art performance on the
Adult Glioma dataset (Dice Score of $93.54$) while demonstrating robust
generalization across Meningioma, Pediatric Glioma, and Sub-Saharan Glioma
datasets. Furthermore, GBT-SAM uses less than 6.5M trainable parameters, thus
offering an efficient solution for brain tumour segmentation. \\ Our code and
models are available at https://github.com/vpulab/med-sam-brain .
♻ ☆ Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs ICLR 2025
Zijia Zhao, Haoyu Lu, Yuqi Huo, Yifan Du, Tongtian Yue, Longteng Guo, Bingning Wang, Weipeng Chen, Jing Liu
Video understanding is a crucial next step for multimodal large language
models (MLLMs). Various benchmarks are introduced for better evaluating the
MLLMs. Nevertheless, current video benchmarks are still inefficient for
evaluating video models during iterative development due to the high cost of
constructing datasets and the difficulty in isolating specific skills. In this
paper, we propose VideoNIAH (Video Needle In A Haystack), a benchmark
construction framework through synthetic video generation. VideoNIAH decouples
video content from their query-responses by inserting unrelated visual
'needles' into original videos. The framework automates the generation of
query-response pairs using predefined rules, minimizing manual labor. The
queries focus on specific aspects of video understanding, enabling more
skill-specific evaluations. The separation between video content and the
queries also allow for increased video variety and evaluations across different
lengths. Utilizing VideoNIAH, we compile a video benchmark VNBench, which
includes tasks such as retrieval, ordering, and counting to evaluate three key
aspects of video understanding: temporal perception, chronological ordering,
and spatio-temporal coherence. We conduct a comprehensive evaluation of both
proprietary and open-source models, uncovering significant differences in their
video understanding capabilities across various tasks. Additionally, we perform
an in-depth analysis of the test results and model configurations. Based on
these findings, we provide some advice for improving video MLLM training,
offering valuable insights to guide future research and model development. The
code and data are available at https://github.com/joez17/VideoNIAH.
comment: ICLR 2025
♻ ☆ Large Language Models are Strong Audio-Visual Speech Recognition Learners ICASSP 2025
Umberto Cappellazzo, Minsu Kim, Honglie Chen, Pingchuan Ma, Stavros Petridis, Daniele Falavigna, Alessio Brutti, Maja Pantic
Multimodal large language models (MLLMs) have recently become a focal point
of research due to their formidable multimodal understanding capabilities. For
example, in the audio and speech domains, an LLM can be equipped with
(automatic) speech recognition (ASR) abilities by just concatenating the audio
tokens, computed with an audio encoder, and the text tokens to achieve
state-of-the-art results. On the contrary, tasks like visual and audio-visual
speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement
information, have received little or no attention. To bridge this gap, we
propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition
capabilities. It leverages pre-trained audio and video encoders to produce
modality-specific tokens which, together with the text tokens, are processed by
a pre-trained LLM (e.g., Llama3.1-8B) to yield the resulting response in an
auto-regressive fashion. Llama-AVSR requires a small number of trainable
parameters as only modality-specific projectors and LoRA modules are trained
whereas the multi-modal encoders and LLM are kept frozen. We evaluate our
proposed approach on LRS3, the largest public AVSR benchmark, and we achieve
new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79%
and 0.77%, respectively. To bolster our results, we investigate the key factors
that underpin the effectiveness of Llama-AVSR: the choice of the pre-trained
encoders and LLM, the efficient integration of LoRA modules, and the optimal
performance-efficiency trade-off obtained via modality-aware compression rates.
comment: Accepted for publication at ICASSP 2025. The code and checkpoints are
available here: https://github.com/umbertocappellazzo/Llama-AVSR
♻ ☆ Question-Aware Gaussian Experts for Audio-Visual Question Answering CVPR 2025
Audio-Visual Question Answering (AVQA) requires not only question-based
multimodal reasoning but also precise temporal grounding to capture subtle
dynamics for accurate prediction. However, existing methods mainly use question
information implicitly, limiting focus on question-specific details.
Furthermore, most studies rely on uniform frame sampling, which can miss key
question-relevant frames. Although recent Top-K frame selection methods aim to
address this, their discrete nature still overlooks fine-grained temporal
details. This paper proposes QA-TIGER, a novel framework that explicitly
incorporates question information and models continuous temporal dynamics. Our
key idea is to use Gaussian-based modeling to adaptively focus on both
consecutive and non-consecutive frames based on the question, while explicitly
injecting question information and applying progressive refinement. We leverage
a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models,
activating temporal experts specifically tailored to the question. Extensive
experiments on multiple AVQA benchmarks show that QA-TIGER consistently
achieves state-of-the-art performance. Code is available at
https://aim-skku.github.io/QA-TIGER/
comment: CVPR 2025. Code is available at https://github.com/AIM-SKKU/QA-TIGER
♻ ☆ Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision
Che Liu, Yingji Zhang, Dong Zhang, Weijie Zhang, Chenggong Gong, Haohan Li, Yu Lu, Shilin Zhou, Yue Lu, Ziliang Gan, Ziao Wang, Junwei Liao, Haipang Wu, Ji Liu, André Freitas, Qifan Wang, Zenglin Xu, Rongjuncheng Zhang, Yong Dai
Human beings perceive the real world through a spectrum of sensory
modalities, encompassing auditory, visual, and linguistic faculties. The
journey towards achieving Artificial General Intelligence (AGI) necessitates
the development of models that can emulate these multifaceted perceptual
capabilities and comprehensively understand these diversified data. To this
end, we introduce \textbf{Nexus-O}, an industry-level \textbf{omni-perceptive
and -interactive} model capable of efficiently processing Audio, Image, Video,
and Text data in any combination and output audio/text in an end-to-end way. We
systematically investigate Nexus-O by addressing three key research questions:
First, how can models be efficiently designed and trained to achieve tri-modal
alignment, understanding and reasoning capabilities across multiple modalities?
Second, what approaches can be implemented to evaluate tri-modal model
robustness, ensuring reliable performance and applicability in real-world
scenarios? Third, what strategies can be employed to curate and obtain
high-quality, real-life scenario speech datasets? For the first question, we
design and pre-train Nexus-O based on the vision-language model, rather than
the language model. By pre-training the model over high-quality synthetic audio
data, our model is capable of tri-modal perception and interaction. For the
second question, we introduce a new audio testbed, Nexus-O-audio, comprising
diverse Automatic Speech Recognition (ASR) samples, spanning various real-world
scenarios, such as corporate meetings and live stream. For the third question,
we design the speech data synthesis pipeline to obtain high-quality speech
training datasets, covering various real-world scenarios. Comprehensive
experimentation and an in-depth analysis of tri-modal alignment over latent
space demonstrate the advantages of our model on downstream tasks.
♻ ☆ PseudoTouch: Efficiently Imaging the Surface Feel of Objects for Robotic Manipulation ICRA 2025
Tactile sensing is vital for human dexterous manipulation, however, it has
not been widely used in robotics. Compact, low-cost sensing platforms can
facilitate a change, but unlike their popular optical counterparts, they are
difficult to deploy in high-fidelity tasks due to their low signal
dimensionality and lack of a simulation model. To overcome these challenges, we
introduce PseudoTouch which links high-dimensional structural information to
low-dimensional sensor signals. It does so by learning a low-dimensional
visual-tactile embedding, wherein we encode a depth patch from which we decode
the tactile signal. We collect and train PseudoTouch on a dataset comprising
aligned tactile and visual data pairs obtained through random touching of eight
basic geometric shapes. We demonstrate the utility of our trained PseudoTouch
model in two downstream tasks: object recognition and grasp stability
prediction. In the object recognition task, we evaluate the learned embedding's
performance on a set of five basic geometric shapes and five household objects.
Using PseudoTouch, we achieve an object recognition accuracy 84% after just ten
touches, surpassing a proprioception baseline. For the grasp stability task, we
use ACRONYM labels to train and evaluate a grasp success predictor using
PseudoTouch's predictions derived from virtual depth information. Our approach
yields a 32% absolute improvement in accuracy compared to the baseline relying
on partial point cloud data. We make the data, code, and trained models
publicly available at https://pseudotouch.cs.uni-freiburg.de.
comment: 7 pages, 5 figures, 2 tables, accepted at ICRA 2025
♻ ☆ DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes
Jianbiao Mei, Tao Hu, Xuemeng Yang, Licheng Wen, Yu Yang, Tiantian Wei, Yukai Ma, Min Dou, Botian Shi, Yong Liu
Recent advances in diffusion models have improved controllable streetscape
generation and supported downstream perception and planning tasks. However,
challenges remain in accurately modeling driving scenes and generating long
videos. To alleviate these issues, we propose DreamForge, an advanced
diffusion-based autoregressive video generation model tailored for
3D-controllable long-term generation. To enhance the lane and foreground
generation, we introduce perspective guidance and integrate object-wise
position encoding to incorporate local 3D correlation and improve foreground
object modeling. We also propose motion-aware temporal attention to capture
motion cues and appearance changes in videos. By leveraging motion frames and
an autoregressive generation paradigm,we can autoregressively generate long
videos (over 200 frames) using a model trained in short sequences, achieving
superior quality compared to the baseline in 16-frame video evaluations.
Finally, we integrate our method with the realistic simulator DriveArena to
provide more reliable open-loop and closed-loop evaluations for vision-based
driving agents. Project Page:
https://pjlab-adg.github.io/DriveArena/dreamforge.
comment: 15 figures, 9 tables
♻ ☆ Modification Takes Courage: Seamless Image Stitching via Reference-Driven Inpainting
Current image stitching methods often produce noticeable seams in challenging
scenarios such as uneven hue and large parallax. To tackle this problem, we
propose the Reference-Driven Inpainting Stitcher (RDIStitcher), which
reformulates the image fusion and rectangling as a reference-based inpainting
model, incorporating a larger modification fusion area and stronger
modification intensity than previous methods. Furthermore, we introduce a
self-supervised model training method, which enables the implementation of
RDIStitcher without requiring labeled data by fine-tuning a Text-to-Image (T2I)
diffusion model. Recognizing difficulties in assessing the quality of stitched
images, we present the Multimodal Large Language Models (MLLMs)-based metrics,
offering a new perspective on evaluating stitched image quality. Compared to
the state-of-the-art (SOTA) method, extensive experiments demonstrate that our
method significantly enhances content coherence and seamless transitions in the
stitched images. Especially in the zero-shot experiments, our method exhibits
strong generalization capabilities. Code:
https://github.com/yayoyo66/RDIStitcher
comment: 18 pages, 10 figures
♻ ☆ VidHal: Benchmarking Temporal Hallucinations in Vision LLMs
Vision Large Language Models (VLLMs) are widely acknowledged to be prone to
hallucinations. Existing research addressing this problem has primarily been
confined to image inputs, with limited exploration of video-based
hallucinations. Furthermore, current evaluation methods fail to capture nuanced
errors in generated responses, which are often exacerbated by the rich
spatiotemporal dynamics of videos. To address this, we introduce VidHal, a
benchmark specially designed to evaluate video-based hallucinations in VLLMs.
VidHal is constructed by bootstrapping video instances across a wide range of
common temporal aspects. A defining feature of our benchmark lies in the
careful creation of captions which represent varying levels of hallucination
associated with each video. To enable fine-grained evaluation, we propose a
novel caption ordering task requiring VLLMs to rank captions by hallucinatory
extent. We conduct extensive experiments on VidHal and comprehensively evaluate
a broad selection of models. Our results uncover significant limitations in
existing VLLMs regarding hallucination generation. Through our benchmark, we
aim to inspire further research on 1) holistic understanding of VLLM
capabilities, particularly regarding hallucination, and 2) extensive
development of advanced VLLMs to alleviate this problem.
comment: 9 pages, 10 figures. Code available at
https://github.com/Lookuz/VidHal
♻ ☆ Multi-Knowledge-oriented Nighttime Haze Imaging Enhancer for Vision-driven Intelligent Systems
Salient object detection (SOD) plays a critical role in vision-driven
measurement systems (VMS), facilitating the detection and segmentation of key
visual elements in an image. However, adverse imaging conditions such as haze
during the day, low light, and haze at night severely degrade image quality,
and complicating the SOD process. To address these challenges, we propose a
multi-task-oriented nighttime haze imaging enhancer (MToIE), which integrates
three tasks: daytime dehazing, low-light enhancement, and nighttime dehazing.
The MToIE incorporates two key innovative components: First, the network
employs a task-oriented node learning mechanism to handle three specific
degradation types: day-time haze, low light, and night-time haze conditions,
with an embedded self-attention module enhancing its performance in nighttime
imaging. In addition, multi-receptive field enhancement module that efficiently
extracts multi-scale features through three parallel depthwise separable
convolution branches with different dilation rates, capturing comprehensive
spatial information with minimal computational overhead. To ensure optimal
image reconstruction quality and visual characteristics, we suggest a hybrid
loss function. Extensive experiments on different types of weather/imaging
conditions illustrate that MToIE surpasses existing methods, significantly
enhancing the accuracy and reliability of vision systems across diverse imaging
scenarios. The code is available at https://github.com/Ai-Chen-Lab/MKoIE.
♻ ☆ Generalized moduli of continuity under irregular or random deformations via multiscale analysis
Motivated by the problem of robustness to deformations of the input for deep
convolutional neural networks, we identify signal classes which are inherently
stable to irregular deformations induced by distortion fields $\tau\in
L^\infty(\mathbb{R}^d;\mathbb{R}^d)$, to be characterized in terms of a
generalized modulus of continuity associated with the deformation operator.
Resorting to ideas of harmonic and multiscale analysis, we prove that for
signals in multiresolution approximation spaces $U_s$ at scale $s$, stability
in $L^2$ holds in the regime $\|\tau\|_{L^\infty}/s\ll 1$ - essentially as an
effect of the uncertainty principle. Instability occurs when
$\|\tau\|_{L^\infty}/s\gg 1$, and we provide a sharp upper bound for the
asymptotic growth rate. The stability results are then extended to signals in
the Besov space $B^{d/2}_{2,1}$ tailored to the given multiresolution
approximation. We also consider the case of more general time-frequency
deformations. Finally, we provide stochastic versions of the aforementioned
results, namely we study the issue of stability in mean when $\tau(x)$ is
modeled as a random field (not bounded, in general) with identically
distributed variables $|\tau(x)|$, $x\in\mathbb{R}^d$.
comment: 25 pages
♻ ☆ DLF: Extreme Image Compression with Dual-generative Latent Fusion
Recent studies in extreme image compression have achieved remarkable
performance by compressing the tokens from generative tokenizers. However,
these methods often prioritize clustering common semantics within the dataset,
while overlooking the diverse details of individual objects. Consequently, this
results in suboptimal reconstruction fidelity, especially at low bitrates. To
address this issue, we introduce a Dual-generative Latent Fusion (DLF)
paradigm. DLF decomposes the latent into semantic and detail elements,
compressing them through two distinct branches. The semantic branch clusters
high-level information into compact tokens, while the detail branch encodes
perceptually critical details to enhance the overall fidelity. Additionally, we
propose a cross-branch interactive design to reduce redundancy between the two
branches, thereby minimizing the overall bit cost. Experimental results
demonstrate the impressive reconstruction quality of DLF even below 0.01 bits
per pixel (bpp). On the CLIC2020 test set, our method achieves bitrate savings
of up to 27.93% on LPIPS and 53.55% on DISTS compared to MS-ILLM. Furthermore,
DLF surpasses recent diffusion-based codecs in visual fidelity while
maintaining a comparable level of generative realism. Code will be available
later.
♻ ☆ RURANET++: An Unsupervised Learning Method for Diabetic Macular Edema Based on SCSE Attention Mechanisms and Dynamic Multi-Projection Head Clustering MICCAI 2025
Diabetic Macular Edema (DME), a prevalent complication among diabetic
patients, constitutes a major cause of visual impairment and blindness.
Although deep learning has achieved remarkable progress in medical image
analysis, traditional DME diagnosis still relies on extensive annotated data
and subjective ophthalmologist assessments, limiting practical applications. To
address this, we present RURANET++, an unsupervised learning-based automated
DME diagnostic system. This framework incorporates an optimized U-Net
architecture with embedded Spatial and Channel Squeeze & Excitation (SCSE)
attention mechanisms to enhance lesion feature extraction. During feature
processing, a pre-trained GoogLeNet model extracts deep features from retinal
images, followed by PCA-based dimensionality reduction to 50 dimensions for
computational efficiency. Notably, we introduce a novel clustering algorithm
employing multi-projection heads to explicitly control cluster diversity while
dynamically adjusting similarity thresholds, thereby optimizing intra-class
consistency and inter-class discrimination. Experimental results demonstrate
superior performance across multiple metrics, achieving maximum accuracy
(0.8411), precision (0.8593), recall (0.8411), and F1-score (0.8390), with
exceptional clustering quality. This work provides an efficient unsupervised
solution for DME diagnosis with significant clinical implications.
comment: 10 pages, 2 figures, 5 tables, submitted to The 28th International
Conference on Medical Image Computing and Computer Assisted Intervention
(MICCAI 2025)
♻ ☆ CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections
Mohamed Fazli Imam, Rufael Fedaku Marew, Jameel Hassan, Mustansar Fiaz, Alham Fikri Aji, Hisham Cholakkal
In the era of foundation models, CLIP has emerged as a powerful tool for
aligning text & visual modalities into a common embedding space. However, the
alignment objective used to train CLIP often results in subpar visual features
for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at
extracting rich visual features due to their specialized training paradigm.
Yet, these SSL models require an additional supervised linear probing step,
which relies on fully labeled data which is often expensive and difficult to
obtain at scale. In this paper, we propose a label-free prompt-tuning method
that leverages the rich visual features of self-supervised learning models
(DINO) and the broad textual knowledge of large language models (LLMs) to
largely enhance CLIP-based image classification performance using unlabeled
images. Our approach unfolds in three key steps: (1) We generate robust textual
feature embeddings that more accurately represent object classes by leveraging
class-specific descriptions from LLMs, enabling more effective zero-shot
classification compared to CLIP's default name-specific prompts. (2) These
textual embeddings are then used to produce pseudo-labels to train an alignment
module that integrates the complementary strengths of LLM description-based
textual embeddings & DINO's visual features. (3) Finally, we prompt-tune CLIP's
vision encoder through DINO-assisted supervision using the trained alignment
module. This three-step process allows us to harness the best of visual &
textual foundation models, resulting in a powerful and efficient approach that
surpasses state-of-the-art label-free classification methods. Notably, our
framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6%
over the state-of-the-art LaFTer across 11 diverse image classification
datasets. Our code & models can be found at https://github.com/fazliimam/NoLA.
♻ ☆ Towards Student Actions in Classroom Scenes: New Dataset and Baseline
Analyzing student actions is an important and challenging task in educational
research. Existing efforts have been hampered by the lack of accessible
datasets to capture the nuanced action dynamics in classrooms. In this paper,
we present a new multi-label Student Action Video (SAV) dataset, specifically
designed for action detection in classroom settings. The SAV dataset consists
of 4,324 carefully trimmed video clips from 758 different classrooms, annotated
with 15 distinct student actions. Compared to existing action detection
datasets, the SAV dataset stands out by providing a wide range of real
classroom scenarios, high-quality video data, and unique challenges, including
subtle movement differences, dense object engagement, significant scale
differences, varied shooting angles, and visual occlusion. These complexities
introduce new opportunities and challenges to advance action detection methods.
To benchmark this, we propose a novel baseline method based on a visual
transformer, designed to enhance attention to key local details within small
and dense object regions. Our method demonstrates excellent performance with a
mean Average Precision (mAP) of 67.9% and 27.4% on the SAV and AVA datasets,
respectively. This paper not only provides the dataset but also calls for
further research into AI-driven educational tools that may transform teaching
methodologies and learning outcomes. The code and dataset are released at
https://github.com/Ritatanz/SAV.
♻ ★ LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding
Shen Zhang, Yaning Tan, Siyuan Liang, Zhaowei Chen, Linze Li, Ge Wu, Yuhao Chen, Shuheng Li, Zhenyu Zhao, Caihua Chen, Jiajun Liang, Yao Tang
Diffusion transformers(DiTs) struggle to generate images at resolutions
higher than their training resolutions. The primary obstacle is that the
explicit positional encodings(PE), such as RoPE, need extrapolation which
degrades performance when the inference resolution differs from training. In
this paper, we propose a Length-Extrapolatable Diffusion Transformer(LEDiT), a
simple yet powerful architecture to overcome this limitation. LEDiT needs no
explicit PEs, thereby avoiding extrapolation. The key innovations of LEDiT are
introducing causal attention to implicitly impart global positional information
to tokens, while enhancing locality to precisely distinguish adjacent tokens.
Experiments on 256x256 and 512x512 ImageNet show that LEDiT can scale the
inference resolution to 512x512 and 1024x1024, respectively, while achieving
better image quality compared to current state-of-the-art length extrapolation
methods(NTK-aware, YaRN). Moreover, LEDiT achieves strong extrapolation
performance with just 100K steps of fine-tuning on a pretrained DiT,
demonstrating its potential for integration into existing text-to-image DiTs.
Project page: https://shenzhang2145.github.io/ledit/
♻ ☆ LRSAA: Large-scale Remote Sensing Image Target Recognition and Automatic Annotation
This paper presents a method for object recognition and automatic labeling in
large-area remote sensing images called LRSAA. The method integrates YOLOv11
and MobileNetV3-SSD object detection algorithms through ensemble learning to
enhance model performance. Furthermore, it employs Poisson disk sampling
segmentation techniques and the EIOU metric to optimize the training and
inference processes of segmented images, followed by the integration of
results. This approach not only reduces the demand for computational resources
but also achieves a good balance between accuracy and speed. The source code
for this project has been made publicly available on
https://github.com/anaerovane/LRSAA.
comment: arXiv admin note: text overlap with arXiv:2411.07802
♻ ★ Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos
Text-to-video generation has demonstrated promising progress with the advent
of diffusion models, yet existing approaches are limited by dataset quality and
computational resources. To address these limitations, this paper presents a
comprehensive approach that advances both data curation and model design. We
introduce CFC-VIDS-1M, a high-quality video dataset constructed through a
systematic coarse-to-fine curation pipeline. The pipeline first evaluates video
quality across multiple dimensions, followed by a fine-grained stage that
leverages vision-language models to enhance text-video alignment and semantic
richness. Building upon the curated dataset's emphasis on visual quality and
temporal coherence, we develop RACCOON, a transformer-based architecture with
decoupled spatial-temporal attention mechanisms. The model is trained through a
progressive four-stage strategy designed to efficiently handle the complexities
of video generation. Extensive experiments demonstrate that our integrated
approach of high-quality data curation and efficient training strategy
generates visually appealing and temporally coherent videos while maintaining
computational efficiency. We will release our dataset, code, and models.
♻ ☆ CAT-3DGS: A Context-Adaptive Triplane Approach to Rate-Distortion-Optimized 3DGS Compression ICLR
3D Gaussian Splatting (3DGS) has recently emerged as a promising 3D
representation. Much research has been focused on reducing its storage
requirements and memory footprint. However, the needs to compress and transmit
the 3DGS representation to the remote side are overlooked. This new application
calls for rate-distortion-optimized 3DGS compression. How to quantize and
entropy encode sparse Gaussian primitives in the 3D space remains largely
unexplored. Few early attempts resort to the hyperprior framework from learned
image compression. But, they fail to utilize fully the inter and intra
correlation inherent in Gaussian primitives. Built on ScaffoldGS, this work,
termed CAT-3DGS, introduces a context-adaptive triplane approach to their
rate-distortion-optimized coding. It features multi-scale triplanes, oriented
according to the principal axes of Gaussian primitives in the 3D space, to
capture their inter correlation (i.e. spatial correlation) for spatial
autoregressive coding in the projected 2D planes. With these triplanes serving
as the hyperprior, we further perform channel-wise autoregressive coding to
leverage the intra correlation within each individual Gaussian primitive. Our
CAT-3DGS incorporates a view frequency-aware masking mechanism. It actively
skips from coding those Gaussian primitives that potentially have little impact
on the rendering quality. When trained end-to-end to strike a good
rate-distortion trade-off, our CAT-3DGS achieves the state-of-the-art
compression performance on the commonly used real-world datasets.
comment: Accepted for Publication in International Conference on Learning
Representations (ICLR)
♻ ☆ NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM
Vision-and-Language Navigation (VLN) is an essential skill for embodied
agents, allowing them to navigate in 3D environments following natural language
instructions. High-performance navigation models require a large amount of
training data, the high cost of manually annotating data has seriously hindered
this field. Therefore, some previous methods translate trajectory videos into
step-by-step instructions for expanding data, but such instructions do not
match well with users' communication styles that briefly describe destinations
or state specific needs. Moreover, local navigation trajectories overlook
global context and high-level task planning. To address these issues, we
propose NavRAG, a retrieval-augmented generation (RAG) framework that generates
user demand instructions for VLN. NavRAG leverages LLM to build a hierarchical
scene description tree for 3D scene understanding from global layout to local
details, then simulates various user roles with specific demands to retrieve
from the scene tree, generating diverse instructions with LLM. We annotate over
2 million navigation instructions across 861 scenes and evaluate the data
quality and navigation performance of trained models.
♻ ☆ Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction
Generalized feed-forward Gaussian models have achieved significant progress
in sparse-view 3D reconstruction by leveraging prior knowledge from large
multi-view datasets. However, these models often struggle to represent
high-frequency details due to the limited number of Gaussians. While the
densification strategy used in per-scene 3D Gaussian splatting (3D-GS)
optimization can be adapted to the feed-forward models, it may not be ideally
suited for generalized scenarios. In this paper, we propose Generative
Densification, an efficient and generalizable method to densify Gaussians
generated by feed-forward models. Unlike the 3D-GS densification strategy,
which iteratively splits and clones raw Gaussian parameters, our method
up-samples feature representations from the feed-forward models and generates
their corresponding fine Gaussians in a single forward pass, leveraging the
embedded prior knowledge for enhanced generalization. Experimental results on
both object-level and scene-level reconstruction tasks demonstrate that our
method outperforms state-of-the-art approaches with comparable or smaller model
sizes, achieving notable improvements in representing fine details.
comment: Project page: https://stnamjef.github.io/GenerativeDensification/
♻ ☆ Neighboring Slice Noise2Noise: Self-Supervised Medical Image Denoising from Single Noisy Image Volume
In the last few years, with the rapid development of deep learning
technologies, supervised methods based on convolutional neural networks have
greatly enhanced the performance of medical image denoising. However, these
methods require large quantities of noisy-clean image pairs for training, which
greatly limits their practicality. Although some researchers have attempted to
train denoising networks using only single noisy images, existing
self-supervised methods, including blind-spot-based and data-splitting-based
methods, heavily rely on the assumption that noise is pixel-wise independent.
However, this assumption often does not hold in real-world medical images.
Therefore, in the field of medical imaging, there remains a lack of simple and
practical denoising methods that can achieve high-quality denoising performance
using only single noisy images. In this paper, we propose a novel
self-supervised medical image denoising method, Neighboring Slice Noise2Noise
(NS-N2N). The proposed method utilizes neighboring slices within a single noisy
image volume to construct weighted training data, and then trains the denoising
network using a self-supervised scheme with regional consistency loss and
inter-slice continuity loss. NS-N2N only requires a single noisy image volume
obtained from one medical imaging procedure to achieve high-quality denoising
of the image volume itself. Extensive experiments demonstrate that the proposed
method outperforms state-of-the-art self-supervised denoising methods in both
denoising performance and processing efficiency. Furthermore, since NS-N2N
operates solely in the image domain, it is free from device-specific issues
such as reconstruction geometry, making it easier to apply in various clinical
practices.
♻ ★ Meta Curvature-Aware Minimization for Domain Generalization
Domain generalization (DG) aims to enhance the ability of models trained on
source domains to generalize effectively to unseen domains. Recently,
Sharpness-Aware Minimization (SAM) has shown promise in this area by reducing
the sharpness of the loss landscape to obtain more generalized models. However,
SAM and its variants sometimes fail to guide the model toward a flat minimum,
and their training processes exhibit limitations, hindering further
improvements in model generalization. In this paper, we first propose an
improved model training process aimed at encouraging the model to converge to a
flat minima. To achieve this, we design a curvature metric that has a minimal
effect when the model is far from convergence but becomes increasingly
influential in indicating the curvature of the minima as the model approaches a
local minimum. Then we derive a novel algorithm from this metric, called Meta
Curvature-Aware Minimization (MeCAM), to minimize the curvature around the
local minima. Specifically, the optimization objective of MeCAM simultaneously
minimizes the regular training loss, the surrogate gap of SAM, and the
surrogate gap of meta-learning. We provide theoretical analysis on MeCAM's
generalization error and convergence rate, and demonstrate its superiority over
existing DG methods through extensive experiments on five benchmark DG
datasets, including PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet. Code
will be available on GitHub.
comment: 22 pages, 5 figures, 16 tables
♻ ☆ Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V ICRA
Peiyuan Zhi, Zhiyuan Zhang, Yu Zhao, Muzhi Han, Zeyu Zhang, Zhitian Li, Ziyuan Jiao, Baoxiong Jia, Siyuan Huang
Autonomous robot navigation and manipulation in open environments require
reasoning and replanning with closed-loop feedback. In this work, we present
COME-robot, the first closed-loop robotic system utilizing the GPT-4V
vision-language foundation model for open-ended reasoning and adaptive planning
in real-world scenarios.COME-robot incorporates two key innovative modules: (i)
a multi-level open-vocabulary perception and situated reasoning module that
enables effective exploration of the 3D environment and target object
identification using commonsense knowledge and situated information, and (ii)
an iterative closed-loop feedback and restoration mechanism that verifies task
feasibility, monitors execution success, and traces failure causes across
different modules for robust failure recovery. Through comprehensive
experiments involving 8 challenging real-world mobile and tabletop manipulation
tasks, COME-robot demonstrates a significant improvement in task success rate
(~35%) compared to state-of-the-art methods. We further conduct comprehensive
analyses to elucidate how COME-robot's design facilitates failure recovery,
free-form instruction following, and long-horizon task planning.
comment: 6 pages, Accepted at 2025 IEEE ICRA, website:
https://come-robot.github.io/
♻ ☆ FoundationStereo: Zero-Shot Stereo Matching CVPR 2025
Tremendous progress has been made in deep stereo matching to excel on
benchmark datasets through per-domain fine-tuning. However, achieving strong
zero-shot generalization - a hallmark of foundation models in other computer
vision tasks - remains challenging for stereo matching. We introduce
FoundationStereo, a foundation model for stereo depth estimation designed to
achieve strong zero-shot generalization. To this end, we first construct a
large-scale (1M stereo pairs) synthetic training dataset featuring large
diversity and high photorealism, followed by an automatic self-curation
pipeline to remove ambiguous samples. We then design a number of network
architecture components to enhance scalability, including a side-tuning feature
backbone that adapts rich monocular priors from vision foundation models to
mitigate the sim-to-real gap, and long-range context reasoning for effective
cost volume filtering. Together, these components lead to strong robustness and
accuracy across domains, establishing a new standard in zero-shot stereo depth
estimation. Project page: https://nvlabs.github.io/FoundationStereo/
comment: CVPR 2025
♻ ☆ Rethinking Pre-Trained Feature Extractor Selection in Multiple Instance Learning for Whole Slide Image Classification
Multiple instance learning (MIL) has become a preferred method for gigapixel
whole slide image (WSI) classification without requiring patch-level
annotations. Current MIL research primarily relies on embedding-based
approaches, which extract patch features using a pre-trained feature extractor
and aggregate them for slide-level prediction. Despite the critical role of
feature extraction, there is limited guidance on selecting optimal feature
extractors to maximize WSI performance. This study addresses this gap by
systematically evaluating MIL feature extractors across three dimensions:
pre-training dataset, backbone model, and pre-training method. Extensive
experiments were conducted on two public WSI datasets (TCGA-NSCLC and
Camelyon16) using four state-of-the-art (SOTA) MIL models. Our findings reveal
that: 1) selecting a robust self-supervised learning (SSL) method has a greater
impact on performance than relying solely on an in-domain pre-training dataset;
2) prioritizing Transformer-based backbones with deeper architectures over
CNN-based models; and 3) using larger, more diverse pre-training datasets
significantly enhances classification outcomes. We hope that these insights can
provide practical guidance for optimizing WSI classification and explain the
reasons behind the performance advantages of the current SOTA pathology
foundation models. Furthermore, this work may inform the development of more
effective pathology foundation models. Our code is publicly available at
https://github.com/bryanwong17/MIL-Feature-Extractor-Selection
comment: Accepted to IEEE International Symposium on Biomedical Imaging (ISBI)
2025
♻ ☆ FastTrackTr:Towards Fast Multi-Object Tracking with Transformers
Transformer-based multi-object tracking (MOT) methods have captured the
attention of many researchers in recent years. However, these models often
suffer from slow inference speeds due to their structure or other issues. To
address this problem, we revisited the Joint Detection and Tracking (JDT)
method by looking back at past approaches. By integrating the original JDT
approach with some advanced theories, this paper employs an efficient method of
information transfer between frames on the DETR, constructing a fast and novel
JDT-type MOT framework: FastTrackTr. Thanks to the superiority of this
information transfer method, our approach not only reduces the number of
queries required during tracking but also avoids the excessive introduction of
network structures, ensuring model simplicity. Experimental results indicate
that our method has the potential to achieve real-time tracking and exhibits
competitive tracking accuracy across multiple datasets.
♻ ☆ FloNa: Floor Plan Guided Embodied Visual Navigation AAAI 2025
Humans naturally rely on floor plans to navigate in unfamiliar environments,
as they are readily available, reliable, and provide rich geometrical guidance.
However, existing visual navigation settings overlook this valuable prior
knowledge, leading to limited efficiency and accuracy. To eliminate this gap,
we introduce a novel navigation task: Floor Plan Visual Navigation (FloNa), the
first attempt to incorporate floor plan into embodied visual navigation. While
the floor plan offers significant advantages, two key challenges emerge: (1)
handling the spatial inconsistency between the floor plan and the actual scene
layout for collision-free navigation, and (2) aligning observed images with the
floor plan sketch despite their distinct modalities. To address these
challenges, we propose FloDiff, a novel diffusion policy framework
incorporating a localization module to facilitate alignment between the current
observation and the floor plan. We further collect $20k$ navigation episodes
across $117$ scenes in the iGibson simulator to support the training and
evaluation. Extensive experiments demonstrate the effectiveness and efficiency
of our framework in unfamiliar scenes using floor plan knowledge. Project
website: https://gauleejx.github.io/flona/.
comment: Accepted by AAAI 2025
♻ ☆ BuildingView: Constructing Urban Building Exteriors Databases with Street View Imagery and Multimodal Large Language Mode
Urban Building Exteriors are increasingly important in urban analytics,
driven by advancements in Street View Imagery and its integration with urban
research. Multimodal Large Language Models (LLMs) offer powerful tools for
urban annotation, enabling deeper insights into urban environments. However,
challenges remain in creating accurate and detailed urban building exterior
databases, identifying critical indicators for energy efficiency, environmental
sustainability, and human-centric design, and systematically organizing these
indicators. To address these challenges, we propose BuildingView, a novel
approach that integrates high-resolution visual data from Google Street View
with spatial information from OpenStreetMap via the Overpass API. This research
improves the accuracy of urban building exterior data, identifies key
sustainability and design indicators, and develops a framework for their
extraction and categorization. Our methodology includes a systematic literature
review, building and Street View sampling, and annotation using the ChatGPT-4O
API. The resulting database, validated with data from New York City, Amsterdam,
and Singapore, provides a comprehensive tool for urban studies, supporting
informed decision-making in urban planning, architectural design, and
environmental policy. The code for BuildingView is available at
https://github.com/Jasper0122/BuildingView.
comment: 15 pages, 6 figures
♻ ☆ InstaFace: Identity-Preserving Facial Editing with Single Image Inference
Facial appearance editing is crucial for digital avatars, AR/VR, and
personalized content creation, driving realistic user experiences. However,
preserving identity with generative models is challenging, especially in
scenarios with limited data availability. Traditional methods often require
multiple images and still struggle with unnatural face shifts, inconsistent
hair alignment, or excessive smoothing effects. To overcome these challenges,
we introduce a novel diffusion-based framework, InstaFace, to generate
realistic images while preserving identity using only a single image. Central
to InstaFace, we introduce an efficient guidance network that harnesses 3D
perspectives by integrating multiple 3DMM-based conditionals without
introducing additional trainable parameters. Moreover, to ensure maximum
identity retention as well as preservation of background, hair, and other
contextual features like accessories, we introduce a novel module that utilizes
feature embeddings from a facial recognition model and a pre-trained
vision-language model. Quantitative evaluations demonstrate that our method
outperforms several state-of-the-art approaches in terms of identity
preservation, photorealism, and effective control of pose, expression, and
lighting.
♻ ☆ VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models
In this paper, we propose a novel framework for solving high-definition video
inverse problems using latent image diffusion models. Building on recent
advancements in spatio-temporal optimization for video inverse problems using
image diffusion models, our approach leverages latent-space diffusion models to
achieve enhanced video quality and resolution. To address the high
computational demands of processing high-resolution frames, we introduce a
pseudo-batch consistent sampling strategy, allowing efficient operation on a
single GPU. Additionally, to improve temporal consistency, we present
pseudo-batch inversion, an initialization technique that incorporates
informative latents from the measurement. By integrating with SDXL, our
framework achieves state-of-the-art video reconstruction across a wide range of
spatio-temporal inverse problems, including complex combinations of frame
averaging and various spatial degradations, such as deblurring,
super-resolution, and inpainting. Unlike previous methods, our approach
supports multiple aspect ratios (landscape, vertical, and square) and delivers
HD-resolution reconstructions (exceeding 1280x720) in under 6 seconds per frame
on a single NVIDIA 4090 GPU.
comment: Project page: https://vision-xl.github.io/
♻ ☆ Diff-Reg v2: Diffusion-Based Matching Matrix Estimation for Image Matching and 3D Registration
Establishing reliable correspondences is crucial for all registration tasks,
including 2D image registration, 3D point cloud registration, and 2D-3D
image-to-point cloud registration. However, these tasks are often complicated
by challenges such as scale inconsistencies, symmetry, and large deformations,
which can lead to ambiguous matches. Previous feature-based and
correspondence-based methods typically rely on geometric or semantic features
to generate or polish initial potential correspondences. Some methods typically
leverage specific geometric priors, such as topological preservation, to devise
diverse and innovative strategies tailored to a given enhancement goal, which
cannot be exhaustively enumerated. Additionally, many previous approaches rely
on a single-step prediction head, which can struggle with local minima in
complex matching scenarios. To address these challenges, we introduce an
innovative paradigm that leverages a diffusion model in matrix space for robust
matching matrix estimation. Our model treats correspondence estimation as a
denoising diffusion process in the matching matrix space, gradually refining
the intermediate matching matrix to the optimal one. Specifically, we apply the
diffusion model in the doubly stochastic matrix space for 3D-3D and 2D-3D
registration tasks. In the 2D image registration task, we deploy the diffusion
model in a matrix subspace where dual-softmax projection regularization is
applied. For all three registration tasks, we provide adaptive matching matrix
embedding implementations tailored to the specific characteristics of each task
while maintaining a consistent "match-to-warp" encoding pattern. Furthermore,
we adopt a lightweight design for the denoising module. In inference, once
points or image features are extracted and fixed, this module performs
multi-step denoising predictions through reverse sampling.
comment: arXiv admin note: text overlap with arXiv:2403.19919
♻ ☆ Attention Mechanism based Cognition-level Scene Understanding
Given a question-image input, the Visual Commonsense Reasoning (VCR) model
can predict an answer with the corresponding rationale, which requires
inference ability from the real world. The VCR task, which calls for exploiting
the multi-source information as well as learning different levels of
understanding and extensive commonsense knowledge, is a cognition-level scene
understanding task. The VCR task has aroused researchers' interest due to its
wide range of applications, including visual question answering, automated
vehicle systems, and clinical decision support. Previous approaches to solving
the VCR task generally rely on pre-training or exploiting memory with long
dependency relationship encoded models. However, these approaches suffer from a
lack of generalizability and losing information in long sequences. In this
paper, we propose a parallel attention-based cognitive VCR network PAVCR, which
fuses visual-textual information efficiently and encodes semantic information
in parallel to enable the model to capture rich information for cognition-level
inference. Extensive experiments show that the proposed model yields
significant improvements over existing methods on the benchmark VCR dataset.
Moreover, the proposed model provides intuitive interpretation into visual
commonsense reasoning.
comment: Published in Information
♻ ☆ M2Distill: Multi-Modal Distillation for Lifelong Imitation Learning ICRA 2025
Lifelong imitation learning for manipulation tasks poses significant
challenges due to distribution shifts that occur in incremental learning steps.
Existing methods often focus on unsupervised skill discovery to construct an
ever-growing skill library or distillation from multiple policies, which can
lead to scalability issues as diverse manipulation tasks are continually
introduced and may fail to ensure a consistent latent space throughout the
learning process, leading to catastrophic forgetting of previously learned
skills. In this paper, we introduce M2Distill, a multi-modal distillation-based
method for lifelong imitation learning focusing on preserving consistent latent
space across vision, language, and action distributions throughout the learning
process. By regulating the shifts in latent representations across different
modalities from previous to current steps, and reducing discrepancies in
Gaussian Mixture Model (GMM) policies between consecutive learning steps, we
ensure that the learned policy retains its ability to perform previously
learned tasks while seamlessly integrating new skills. Extensive evaluations on
the LIBERO lifelong imitation learning benchmark suites, including
LIBERO-OBJECT, LIBERO-GOAL, and LIBERO-SPATIAL, demonstrate that our method
consistently outperforms prior state-of-the-art methods across all evaluated
metrics.
comment: IEEE ICRA 2025
♻ ☆ Jointly Understand Your Command and Intention:Reciprocal Co-Evolution between Scene-Aware 3D Human Motion Synthesis and Analysis
As two intimate reciprocal tasks, scene-aware human motion synthesis and
analysis require a joint understanding between multiple modalities, including
3D body motions, 3D scenes, and textual descriptions. In this paper, we
integrate these two paired processes into a Co-Evolving Synthesis-Analysis
(CESA) pipeline and mutually benefit their learning. Specifically, scene-aware
text-to-human synthesis generates diverse indoor motion samples from the same
textual description to enrich human-scene interaction intra-class diversity,
thus significantly benefiting training a robust human motion analysis system.
Reciprocally, human motion analysis would enforce semantic scrutiny on each
synthesized motion sample to ensure its semantic consistency with the given
textual description, thus improving realistic motion synthesis. Considering
that real-world indoor human motions are goal-oriented and path-guided, we
propose a cascaded generation strategy that factorizes text-driven
scene-specific human motion generation into three stages: goal inferring, path
planning, and pose synthesizing. Coupling CESA with this powerful cascaded
motion synthesis model, we jointly improve realistic human motion synthesis and
robust human motion analysis in 3D scenes.
♻ ☆ Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery ICLR 2025
Guankun Wang, Long Bai, Wan Jun Nah, Jie Wang, Zhaoxi Zhang, Zhen Chen, Jinlin Wu, Mobarakol Islam, Hongbin Liu, Hongliang Ren
Recent advancements in Surgical Visual Question Answering (Surgical-VQA) and
related region grounding have shown great promise for robotic and medical
applications, addressing the critical need for automated methods in
personalized surgical mentorship. However, existing models primarily provide
simple structured answers and struggle with complex scenarios due to their
limited capability in recognizing long-range dependencies and aligning
multimodal information. In this paper, we introduce Surgical-LVLM, a novel
personalized large vision-language model tailored for complex surgical
scenarios. Leveraging the pre-trained large vision-language model and
specialized Visual Perception LoRA (VP-LoRA) blocks, our model excels in
understanding complex visual-language tasks within surgical contexts. In
addressing the visual grounding task, we propose the Token-Interaction (TIT)
module, which strengthens the interaction between the grounding module and the
language responses of the Large Visual Language Model (LVLM) after projecting
them into the latent space. We demonstrate the effectiveness of Surgical-LVLM
on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly
introduced EndoVis Conversations dataset, which sets new performance standards.
Our work contributes to advancing the field of automated surgical mentorship by
providing a context-aware solution.
comment: The manuscript is accepted by ICLR 2025 FM-Wild Workshop
♻ ☆ Real-Time Convolutional Neural Network-Based Star Detection and Centroiding Method for CubeSat Star Tracker
Star trackers are one of the most accurate celestial sensors used for
absolute attitude determination. The devices detect stars in captured images
and accurately compute their projected centroids on an imaging focal plane with
subpixel precision. Traditional algorithms for star detection and centroiding
often rely on threshold adjustments for star pixel detection and pixel
brightness weighting for centroid computation. However, challenges like high
sensor noise and stray light can compromise algorithm performance. This article
introduces a Convolutional Neural Network (CNN)-based approach for star
detection and centroiding, tailored to address the issues posed by noisy star
tracker images in the presence of stray light and other artifacts. Trained
using simulated star images overlayed with real sensor noise and stray light,
the CNN produces both a binary segmentation map distinguishing star pixels from
the background and a distance map indicating each pixel's proximity to the
nearest star centroid. Leveraging this distance information alongside pixel
coordinates transforms centroid calculations into a set of trilateration
problems solvable via the least squares method. Our method employs efficient
UNet variants for the underlying CNN architectures, and the variants'
performances are evaluated. Comprehensive testing has been undertaken with
synthetic image evaluations, hardware-in-the-loop assessments, and night sky
tests. The tests consistently demonstrated that our method outperforms several
existing algorithms in centroiding accuracy and exhibits superior resilience to
high sensor noise and stray light interference. An additional benefit of our
algorithms is that they can be executed in real-time on low-power edge AI
processors.
♻ ☆ RowDetr: End-to-End Row Detection Using Polynomials
Crop row detection is essential for enabling autonomous navigation in
GPS-denied environments, such as under-canopy agricultural settings.
Traditional methods often struggle with occlusions, variable lighting
conditions, and the structural variability of crop rows. To address these
challenges, RowDetr, a novel end-to-end neural network architecture, is
introduced for robust and efficient row detection. A new dataset of
approximately 6,900 images is curated, capturing a diverse range of real-world
agricultural conditions, including occluded rows, uneven terrain, and varying
crop densities. Unlike previous approaches, RowDetr leverages smooth polynomial
functions to precisely delineate crop boundaries in the image space, ensuring a
more structured and interpretable representation of row geometry. A key
innovation of this approach is PolyOptLoss, a novel energy-based loss function
designed to enhance learning robustness, even in the presence of noisy or
imperfect labels. This loss function significantly improves model stability and
generalization by optimizing polynomial curve fitting directly in image space.
Extensive experiments demonstrate that RowDetr significantly outperforms
existing frameworks, including Agronav and RowColAttention, across key
performance metrics. Additionally, RowDetr achieves a sixfold speedup over
Agronav, making it highly suitable for real-time deployment on
resource-constrained edge devices. To facilitate better comparisons across
future studies, lane detection metrics from autonomous driving research are
adapted, providing a more standardized and meaningful evaluation framework for
crop row detection. This work establishes a new benchmark in under-canopy
comment: Code will be open sourced upon publication