HQ-CLIP

Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

Zhixiang Wei1,2,*
Guangting Wang2,*
Xiaoxiao Ma1
Ke Mei2
Huaian Chen1
Yi Jin1
Fengyun Rao2
1University of Science and Technology of China
2WeChat Vision, Tencent Inc.
*Equal Contribution
ICCV 2025

Abstract

Large-scale but noisy image-text pair data have paved the way for the success of Contrastive Language-Image Pretraining (CLIP). As the foundation vision encoder, CLIP in turn serves as the cornerstone for most large vision-language models (LVLMs). This interdependence naturally raises an interesting question: Can we reciprocally leverage LVLMs to enhance the quality of image-text pair data, thereby opening the possibility of a self-reinforcing cycle for continuous improvement? In this work, we take a significant step toward this vision by introducing an LVLM-driven data refinement pipeline.

Our framework leverages LVLMs to process images and their raw alt-text, generating four complementary textual formulas: long positive descriptions, long negative descriptions, short positive tags, and short negative tags. Applying this pipeline to the curated DFN-Large dataset yields VLM-150M, a refined dataset enriched with multi-grained annotations. Based on this dataset, we further propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags as additional supervised signals.

The resulting model, HQ-CLIP, demonstrates remarkable improvements across diverse benchmarks. Within a comparable training data scale, our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks. In retrieval benchmarks, HQ-CLIP even surpasses standard CLIP models trained on the DFN-2B dataset, which contains 10x more training data than ours.

A visual comparison showing how HQ-CLIP refines noisy web data into high-quality, multi-grained image-text pairs.

HQ-CLIP leverages advanced LVLMs to transform noisy, raw image-text pairs from the web into a high-quality dataset with rich, multi-grained annotations, significantly boosting model performance.

Research Challenge

Problem

Existing methods for enhancing CLIP with LLMs suffer from limitations:

  • Single-modality approaches (LaCLIP, WhatIf) neglect cross-modal correlations
  • Hybrid methods (CapFusion, VeCLIP) introduce computational complexity
  • Cascade pipelines cause potential error propagation
  • Information asymmetry between visual and textual modalities
  • High computational cost of using SoTA close-source LVLMs at scale

Our Solution

HQ-CLIP introduces a unified approach:

  • Single LVLM processes both images and paired texts simultaneously
  • Generates four complementary descriptions: long/short positives and negatives
  • Cost-efficient pipeline: SFT on compact LVLMs after GPT-4o curation
  • Extension of contrastive learning with two innovations:
    • Short-Tag Classification (STC) loss
    • Hard Negative Identification (HNI) mechanism

Approach Overview

Our approach consists of two main stages: an LVLM-driven data refinement pipeline to generate high-quality, multi-grained annotations, and a novel training paradigm, HQ-CLIP, that effectively utilizes this enriched data. The entire framework is designed for efficiency and scalability.

Diagram of the HQ-CLIP framework, showing the data refinement pipeline and the training process.

The overall framework of HQ-CLIP. We first use a fine-tuned LVLM to process raw image-text pairs into four types of textual descriptions. Then, we train the CLIP model using these multi-grained signals, incorporating Hard Negative Identification (HNI) and Short-Tag Classification (STC).

📜

Long Positive Descriptions

Detailed, contextual descriptions aligned with the image content, providing richer information over raw text data.

🚫

Long Negative Descriptions

Contradictory descriptions used in Hard Negative Identification (HNI) to strengthen CLIP's ability to discern subtle discrepancies.

🏷️

Short Positive Tags

Concise categorical tags enabling Short-Tag Classification (STC) for discrete classification targets.

Short Negative Tags

Contrastive tags providing fine-grained discriminative signals for visual-textual alignment.

Experimental Results

We evaluated HQ-CLIP on a comprehensive set of 38 benchmark datasets. Our method demonstrates state-of-the-art performance in zero-shot classification and cross-modal retrieval, significantly outperforming models trained on comparable or even much larger datasets. The benefits also extend to downstream tasks, where HQ-CLIP serves as a superior vision backbone for LVLMs like LLaVA-1.5.

Comparison with State-of-the-Art Models

HQ-CLIP achieves top performance on zero-shot ImageNet classification and cross-modal retrieval benchmarks (Flickr30K, MSCOCO), surpassing previous methods trained on similar data scales.

Table comparing HQ-CLIP with SoTA models on zero-shot and retrieval tasks.

Ablation Studies

Our ablation studies validate the effectiveness of VLM-150M and HQ-CLIP.

Table showing ablation study results for different components of HQ-CLIP.

Performance as an LVLM Visual Encoder

When used as the vision backbone for LLaVA-1.5, our pre-trained HQ-CLIP leads to improved performance compared to the other CLIP backbones.

Table showing the performance of HQ-CLIP as a visual encoder in LLaVA-1.5.

Key Contributions

  • We introduce an efficient and effective LVLM-driven data refinement pipeline and apply it to DFN-Large, creating VLM-150M, a high-quality dataset with multi-grained descriptions.
  • We propose HQ-CLIP, a specialized framework that combines Hard Negative Identification (HNI) for fine-grained understanding and Short-Tag Classification (STC) for categorical semantic recognition.
  • Through large-scale experiments, HQ-CLIP demonstrates state-of-the-art zero-shot generalization and exceptional cross-modal retrieval capabilities, surpassing models trained on 10x more data.
  • When deployed as the visual backbone for LLaVA-1.5, HQ-CLIP outperforms other ViT-B architectures, showcasing its potential as a superior vision encoder for LVLMs.

Poster

Conference poster for the HQ-CLIP paper

Official conference poster for HQ-CLIP presented at ICCV 2025.

Citation

@misc{hqclip, title={HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models}, author={Zhixiang Wei and Guangting Wang and Xiaoxiao Ma and Ke Mei and Huaian Chen and Yi Jin and Fengyun Rao}, year={2025}, eprint={2507.22431}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2507.22431}, }