nielsr HF Staff commited on
Commit
1df1507
·
verified ·
1 Parent(s): 0b1e077

Improve model card: Add pipeline tag, paper info, and GitHub link for SPECS

Browse files

This PR significantly enhances the model card for **SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation** by:

* Adding the `pipeline_tag: zero-shot-image-classification` to the metadata, improving discoverability on the Hub (e.g., https://huggingface.co/models?pipeline_tag=zero-shot-image-classification).
* Including the paper title and a direct link to the Hugging Face paper page: [SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation](https://huggingface.co/papers/2509.03897).
* Adding a link to the official GitHub repository: https://github.com/mbzuai-nlp/SPECS.
* Incorporating the paper abstract into the model card content for better understanding.
* Adding a comprehensive BibTeX citation for proper academic attribution.

The existing Python usage example has been retained as it is directly applicable and sourced from the official GitHub repository. No `library_name` was added as there was no evidence of compatibility with Hugging Face specific libraries. The provided project page URL was deemed irrelevant to SPECS and therefore omitted.

Files changed (1) hide show
  1. README.md +34 -12
README.md CHANGED
@@ -1,16 +1,24 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - Lin-Chen/ShareGPT4V
5
  language:
6
  - en
7
- base_model:
8
- - BeichenZhang/LongCLIP-B
9
  ---
10
 
 
11
 
12
- You can compute SPECS scores for an image–caption pair using the following code:
 
 
 
 
13
 
 
 
14
 
15
  ```python
16
  from PIL import Image
@@ -66,11 +74,25 @@ for i, score in enumerate(specs_scores.squeeze()):
66
 
67
  This shows that SPECS successfully assigns progressively higher scores to captions with more fine-grained and correct details:
68
 
69
- - **Text 1**: *"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven jumper."*
70
- → **Score: 0.4293**
71
-
72
- - **Text 2**: *"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket."*
73
- → **Score: 0.4457**
74
-
75
- - **Text 3**: *"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket with fringed edges."*
76
- → **Score: 0.4583**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - BeichenZhang/LongCLIP-B
4
  datasets:
5
  - Lin-Chen/ShareGPT4V
6
  language:
7
  - en
8
+ license: apache-2.0
9
+ pipeline_tag: zero-shot-image-classification
10
  ---
11
 
12
+ # SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation
13
 
14
+ This model is presented in the paper [SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation](https://huggingface.co/papers/2509.03897).
15
+ The official code repository is available at: https://github.com/mbzuai-nlp/SPECS.
16
+
17
+ ## Abstract
18
+ As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development. We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.
19
 
20
+ ## Usage
21
+ You can compute SPECS scores for an image–caption pair using the following code:
22
 
23
  ```python
24
  from PIL import Image
 
74
 
75
  This shows that SPECS successfully assigns progressively higher scores to captions with more fine-grained and correct details:
76
 
77
+ - **Text 1**: *"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven jumper."*
78
+ → **Score: 0.4293**
79
+
80
+ - **Text 2**: *"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket."*
81
+ → **Score: 0.4457**
82
+
83
+ - **Text 3**: *"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket with fringed edges."*
84
+ → **Score: 0.4583**
85
+
86
+ ## Citation
87
+ If you find our work helpful for your research, please consider giving a citation:
88
+ ```bibtex
89
+ @misc{chen2025specs,
90
+ title={{SPECS}: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation},
91
+ author={Xiaofu Chen and Israfel Salazar and Yova Kementchedjhieva},
92
+ year={2025},
93
+ eprint={2509.03897},
94
+ archivePrefix={arXiv},
95
+ primaryClass={cs.CL},
96
+ url={https://arxiv.org/abs/2509.03897},
97
+ }
98
+ ```