XiaoFu666 commited on
Commit
5e9812a
·
verified ·
1 Parent(s): 35f8550

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -2
README.md CHANGED
@@ -5,5 +5,68 @@ datasets:
5
  language:
6
  - en
7
  base_model:
8
- - openai/clip-vit-base-patch32
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  language:
6
  - en
7
  base_model:
8
+ - BeichenZhang/LongCLIP-B
9
+ ---
10
+
11
+
12
+ You can compute SPECS scores for an image–caption pair using the following code:
13
+
14
+
15
+ ```python
16
+ from PIL import Image
17
+ import torch
18
+ import torch.nn.functional as F
19
+ from model import longclip
20
+
21
+ # Device configuration
22
+ device = "cuda" if torch.cuda.is_available() else "cpu"
23
+ print(f"Using device: {device}")
24
+
25
+ # Load SPECS model
26
+ model, preprocess = longclip.load("spec.pt", device=device)
27
+ model.eval()
28
+
29
+ # Load image
30
+ image_path = "SPECS/images/cat.png"
31
+ image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
32
+
33
+ # Define text descriptions
34
+ texts = [
35
+ "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
36
+ "The cat is partially tucked under a multi-colored woven jumper.",
37
+ "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
38
+ "The cat is partially tucked under a multi-colored woven blanket.",
39
+ "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
40
+ "The cat is partially tucked under a multi-colored woven blanket with fringed edges."
41
+ ]
42
+
43
+ # Process inputs
44
+ text_tokens = longclip.tokenize(texts).to(device)
45
+
46
+ # Get features and calculate SPECS
47
+ with torch.no_grad():
48
+ image_features = model.encode_image(image)
49
+ text_features = model.encode_text(text_tokens)
50
+
51
+ # Calculate cosine similarity
52
+ similarity = F.cosine_similarity(image_features.unsqueeze(1), text_features.unsqueeze(0), dim=-1)
53
+
54
+ # SPECS
55
+ specs_scores = torch.clamp((similarity + 1.0) / 2.0, min=0.0)
56
+
57
+ # Output results
58
+ print("SPECS")
59
+ for i, score in enumerate(specs_scores.squeeze()):
60
+ print(f" Text {i+1}: {score:.4f}")
61
+ ```
62
+
63
+ This shows that SPECS successfully assigns progressively higher scores to captions with more fine-grained and correct details:
64
+
65
+ - **Text 1**: *"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven jumper."*
66
+ → **Score: 0.4293**
67
+
68
+ - **Text 2**: *"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket."*
69
+ → **Score: 0.4457**
70
+
71
+ - **Text 3**: *"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket with fringed edges."*
72
+ → **Score: 0.4583**