[ICCV 2025] HQ-CLIP: Enhancing CLIP with Large Vision-Language Models

Large-scale but noisy image-text pair data have paved the way for the success of Contrastive Language-Image Pretraining (CLIP). As the foundation vision encoder, CLIP in turn serves as the cornerstone for most large vision-language models (LVLMs). This interdependence naturally raises an interesting question: Can we reciprocally leverage LVLMs to enhance the quality of image-text pair data, thereby opening the possibility of a self-reinforcing cycle for continuous improvement? In this work, we take a significant step toward this vision by introducing an LVLM-driven data refinement pipeline.

Our framework leverages LVLMs to process images and their raw alt-text, generating four complementary textual formulas: long positive descriptions, long negative descriptions, short positive tags, and short negative tags. Applying this pipeline to the curated DFN-Large dataset yields VLM-150M, a refined dataset enriched with multi-grained annotations. Based on this dataset, we further propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags as additional supervised signals.

The resulting model, HQ-CLIP, demonstrates remarkable improvements across diverse benchmarks. Within a comparable training data scale, our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks. In retrieval benchmarks, HQ-CLIP even surpasses standard CLIP models trained on the DFN-2B dataset, which contains 10x more training data than ours.

Existing methods for enhancing CLIP with LLMs suffer from limitations:

Single-modality approaches (LaCLIP, WhatIf) neglect cross-modal correlations
Hybrid methods (CapFusion, VeCLIP) introduce computational complexity
Cascade pipelines cause potential error propagation
Information asymmetry between visual and textual modalities
High computational cost of using SoTA close-source LVLMs at scale

HQ-CLIP introduces a unified approach:

Single LVLM processes both images and paired texts simultaneously
Generates four complementary descriptions: long/short positives and negatives
Cost-efficient pipeline: SFT on compact LVLMs after GPT-4o curation
Extension of contrastive learning with two innovations:
- Short-Tag Classification (STC) loss
- Hard Negative Identification (HNI) mechanism

Our approach consists of two main stages: an LVLM-driven data refinement pipeline to generate high-quality, multi-grained annotations, and a novel training paradigm, HQ-CLIP, that effectively utilizes this enriched data. The entire framework is designed for efficiency and scalability.

We evaluated HQ-CLIP on a comprehensive set of 38 benchmark datasets. Our method demonstrates state-of-the-art performance in zero-shot classification and cross-modal retrieval, significantly outperforming models trained on comparable or even much larger datasets. The benefits also extend to downstream tasks, where HQ-CLIP serves as a superior vision backbone for LVLMs like LLaVA-1.5.

We introduce an efficient and effective LVLM-driven data refinement pipeline and apply it to DFN-Large, creating VLM-150M, a high-quality dataset with multi-grained descriptions.
We propose HQ-CLIP, a specialized framework that combines Hard Negative Identification (HNI) for fine-grained understanding and Short-Tag Classification (STC) for categorical semantic recognition.
Through large-scale experiments, HQ-CLIP demonstrates state-of-the-art zero-shot generalization and exceptional cross-modal retrieval capabilities, surpassing models trained on 10x more data.
When deployed as the visual backbone for LLaVA-1.5, HQ-CLIP outperforms other ViT-B architectures, showcasing its potential as a superior vision encoder for LVLMs.

HQ-CLIP

Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

Abstract

Research Challenge

Problem

Our Solution

Approach Overview

Long Positive Descriptions

Long Negative Descriptions

Short Positive Tags

Short Negative Tags

Experimental Results

Comparison with State-of-the-Art Models

Ablation Studies

Performance as an LVLM Visual Encoder

Key Contributions

Poster

Citation