CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation (ICCV 2025)

This repo is the official pytorch implementation of the CLIPer.

🔥 News

We have incorporated a sliding window method into the code.
A multi-GPU version has been released (by running sh_ovs_mgp.sh), along with new results on several widely adopted benchmark datasets.
We are pleased to announce that this work has been accepted by ICCV 2025.
We have released a gradio demo, and you can find it in huggingface with CPU and ModelScape with GPU or in you device by running app.py.
We have released the sourse code of CLIPer (all the models needed will be downloaded automatically when running the code).

Introduction

We introduce the early-layer fusion, including patch embeddings and attention maps of early layer to the final layer of CLIP for improving spatial representation.
We further introduce the fine-grained compensation, utilizing the detailed spacial information in the self-attention maps of Stable Diffusion to obtain the precise segmentation results.
The proposed CLIPer achieve the state-of-the-art performance on multiple datasets in open-vocabulary semantic segmentation tasks. our proposed CLIPer obtains mIoU scores of 69.8% on VOC and 43.3% on Object when using ViT-L/14 backbone.

For further details, please check out our paper.

Installation

Please follow the code bellow to create the environment

conda create -n CLIPer python=3.9
conda activate CLIPer
pip install -r requirements.txt

Data Preparation

Please struct the datasets as follows

datasets
├── ADEchallengeData2016
│   ├── images
│   │   ├── training
│   │   │   ├── ADE_train_00000001.jpg
│   │   ├── validation
│   │   │   ├── ADE_val_00000001.jpg
│   ├── annotations
│   │   ├── training
│   │   │   ├── ADE_train_00000001.png
│   │   ├── validation
│   │   │   ├── ADE_val_00000001.png
├── CityScapes
│   ├── gtFine
│   │   ├── train
│   │   ├── test
│   │   ├── val
│   │   │   ├── frankfurt
│   │   │   │   ├── frankfurt_000000_000294_gtFine_labelTrainIds.png
│   │   │   │   ├── ...
│   │   │   ├── lindau
│   │   │   ├── munster
├── ├── leftImg8bit
│   │   ├── train
│   │   ├── test
│   │   ├── val
│   │   │   ├── frankfurt
│   │   │   │   ├── frankfurt_000000_000294_leftImg8bit.png
│   │   │   │   ├── ...
│   │   │   ├── lindau
│   │   │   ├── munster
├── coco2014
│   ├── train2014
│   │   ├── COCO_train2014_000000000009.jpg
│   ├── val2014
│   │   ├── COCO_val2014_000000000042.jpg
│   ├── coco_seg_anno
│   │   ├── 000000000009.png
├── coco2017
│   ├── train2017
│   │   ├── 000000000009.jpg
│   ├── val2017
│   │   ├── 000000000139.jpg
│   ├── stuff_anno164
│   │   ├── train2017
│   │   │   ├── 000000000009.png
│   │   ├── val2017
│   │   │   ├── 000000000139.png
├── VOCdevkit
│   ├── VOC2010
│   │   ├── JPEGImages
│   │   │   ├── 2007_000027.jpg
│   │   ├── SegmentationClassContext
│   │   │   ├──2008_000002.png
│   ├── VOC2012
│   │   ├── JPEGImages
│   │   │   ├── 2007_000027.jpg
│   │   ├── SegmentationClassAug
│   │   │   ├──2007_000032.png

Evaluation

To evaluate our CLIPer, please enter the scripts folder and run the code

# select the config file to evaluate the code
# evaluate voc dataset with background
sh sh_ovs.sh ../scripts/config/vit-l-14/ovs_voc21.yaml
# evaluate voc dataset without background
sh sh_ovs.sh ../scripts/config/vit-l-14/ovs_voc20.yaml
# ...

Results

Run the code in this repo, you should get similar results (* means using sliding widow, and values in parentheses indicate results obtained without applying FGC) in the following table:

Encoder	VOC	Context	Object	VOC20	Contex59	Stuff	ADE	CITY
ViT-B/16	66.4(60.7)	37.5(34.7)	39.2(36.3)	85.5(84.3)	41.6(38.4)	27.6(25.4)	21.5(19.8)	37.0(35.9)
ViT-B/16*	66.5(62.1)	38.3(35.7)	40.0(37.5)	86.0(85.0)	42.4(39.6)	28.6(26.4)	22.0(20.6)	38.7(38.3)
ViT-L/14	71.6(62.5)	39.0(34.8)	44.4(40.4)	90.3(88.7)	44.0(39.6)	29.7(26.3)	24.5(22.1)	41.6(37.9)
ViT-L/14*	72.2(64.0)	39.5(35.7)	44.7(41.4)	89.8(89.0)	44.6(40.9)	30.4(27.1)	25.0(22.9)	42.5(39.4)

Visualization

Citation

@misc{Sun_2024_CLIPer,
      title={CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation}, 
      author={Lin Sun and Jiale Cao and Jin Xie and Xiaoheng Jiang and Yanwei Pang},
      year={2024},
      eprint={2411.13836},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.13836}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
data		data
diffusion_model		diffusion_model
modified_clip		modified_clip
ovs		ovs
scripts		scripts
template		template
util		util
.gitattributes		.gitattributes
.gitignore		.gitignore
ReadMe.md		ReadMe.md
app.py		app.py
embeddings_large14.pth		embeddings_large14.pth
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation (ICCV 2025)

🔥 News

Introduction

Installation

Data Preparation

Evaluation

Results

Visualization

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

linsun449/cliper.code

Folders and files

Latest commit

History

Repository files navigation

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation (ICCV 2025)

🔥 News

Introduction

Installation

Data Preparation

Evaluation

Results

Visualization

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages