DC-VLAQ: Query–Residual Aggregation for Robust Visual Place Recognition

1 BCCITA Provincial Key Laboratory, Hangzhou Dianzi University, China ; 2 TopXGun Robotics, China ; 3 ICT State Key Laboratory, Zhejiang University, China ; 4 I3A, University of Zaragoza, Spain
* Equal contribution   Corresponding author

(Top) Most existing VPR methods rely on a single VFM to extract local tokens, which ignores complementary cues across different models. (Mid.) Naive fusion of heterogeneous VFM tokens results in degraded performance, due to distorted distributions and unstable retrieval geometry. (Bot.) Our DC-VLAQ introduces Residual-Guided Complementary Fusion to preserve the original token distribution while injecting complementary information, and Query-Residual Global Aggregation to achieve stable and discriminative global descriptors.

项目介绍图片

Abstract

One of the central challenges in visual place recognition (VPR) is learning a robust global representation that remains discriminative under large viewpoint changes, illumination variations, and severe domain shifts. While visual foundation models (VFMs) provide strong local features, most existing methods rely on a single model, overlooking the complementary cues offered by different VFMs. However, exploiting such complementary information inevitably alters token distributions, which challenges the stability of existing query–based global aggregation schemes. To address these challenges, we propose DC-VLAQ, a representation-centric framework that integrates the fusion of complementary VFMs and robust global aggregation. Specifically, we first introduce a lightweight residual-guided complementary fusion that anchors representations in the DINOv2 feature space while injecting complementary semantics from CLIP through a learned residual correction. In addition, we propose the Vector of Local Aggregated Queries (VLAQ), a query-residual global aggregation scheme that encodes local tokens by their residual responses to learnable queries, resulting in improved stability and the preservation of fine-grained discriminative cues. Extensive experiments on standard VPR benchmarks, including Pitts30k, Tokyo24/7, MSLS, Nordland, SPED, and AmsterTime, demonstrate that DC-VLAQ consistently outperforms strong baselines and achieves state-of-the-art performance, particularly under challenging domain shifts and long-term appearance changes.

Overview of the pipeline

DC-VLAQ pipeline

An input image is first encoded by DINOv2 and CLIP to extract complementary local features. Then, a residual-guided fusion module injects semantic information from CLIP as residual corrections anchored to DINOv2 features. Finally, the fused tokens are aggregated by the proposed VLAQ aggregator to produce a compact global descriptor for nearest-neighbor place retrieval.

Qualitative comparison

Qualitative comparison

DC-VLAQ consistently retrieves visually and structurally consistent matches under challenging appearance changes, whereas baseline methods often fail due to over-reliance on global appearance cues.

Visualization of query activation heatmaps

Compared to pre-trained DINOv2 and CLIP, DC-VLAQ produces more focused and spatially consistent activations on stable structural elements such as building facades, road boundaries, and static landmarks, while suppressing transient or less informative regions.

Visualization of query activation heatmaps

Results

Standard VPR Benchmarks

The best is highlighted in bold, and the second best is underlined. CricaVPR uses a cross-image encoder to correlate multiple images per place and is therefore excluded from the Pitts30k comparison. SALAD-CM leverages MSLS as an additional training set and is thus excluded from the MSLS comparison.

MethodVenue & Year Pitts30k-test Tokyo24/7 MSLS-val MSLS-challenge Nordland
R@1R@5R@10 R@1R@5R@10 R@1R@5R@10 R@1R@5R@10 R@1R@5R@10
NetVLADCVPR'1681.991.293.760.668.974.653.166.571.135.147.451.76.410.112.5
SFRSECCV'2089.494.795.981.088.392.469.280.383.141.652.056.316.123.928.4
Patch-NetVLADCVPR'2188.794.595.986.088.690.579.586.287.748.157.660.544.950.252.2
TransVPRCVPR'2289.094.996.279.082.285.186.891.292.463.974.077.563.568.570.2
CosPlaceCVPR'2288.494.595.781.990.292.782.889.792.061.472.076.658.573.779.4
EigenPlacesICCV'2392.596.897.693.096.297.589.193.895.067.477.181.771.283.888.1
MixVPRWACV'2391.595.596.385.191.794.388.092.794.664.075.980.676.286.990.3
SelaVPRICLR'2492.896.897.794.096.897.590.896.497.273.587.590.687.393.895.6
CricaVPRCVPR'2494.997.398.293.097.598.190.095.496.469.082.185.790.796.397.6
BoQCVPR'2493.797.197.998.198.198.793.896.897.079.090.392.090.696.097.5
SALADCVPR'2492.596.497.594.697.597.892.296.497.075.088.891.389.795.597.0
SALAD-CMECCV'2492.796.897.994.697.597.894.297.297.482.791.292.790.796.697.5
EDTFormerTCSVT'2593.497.097.997.198.198.492.096.697.278.489.891.988.395.397.0
FoL-globalAAAI'2596.298.798.793.196.997.478.790.893.087.8
DC-VLAQ (Ours)94.397.698.398.799.799.794.297.397.681.792.294.592.897.298.2

Robustness-Oriented Benchmarks

The best is highlighted in bold, and the second best is underlined.

MethodVenue & Year SPED AmsterTime
R@1R@5R@10R@1R@5R@10
NetVLADCVPR'1678.788.391.416.3
GeMTPAMI'1864.679.483.5
CosPlaceCVPR'2275.385.988.647.7
EigenPlacesICCV'2382.491.494.748.9
MixVPRWACV'2385.292.194.640.2
SelaVPRICLR'2489.5
CricaVPRCVPR'2464.782.887.5
BoQCVPR'2492.595.996.763.081.685.1
SALADCVPR'2492.196.2
SALAD-CMECCV'2489.3
EDTFormerTCSVT'2565.285.089.0
FoL-globalAAAI'2592.164.6
DC-VLAQ (Ours)93.997.798.266.885.688.9

BibTeX

@misc{zhu2026dcvlaqqueryresidualaggregationrobust,
      title={DC-VLAQ: Query-Residual Aggregation for Robust Visual Place Recognition}, 
      author={Hanyu Zhu and Zhihao Zhan and Yuhang Ming and Liang Li and Dibo Hou and Javier Civera and Wanzeng Kong},
      year={2026},
      eprint={2601.12729},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.12729}, 
}