Sentence Similarity
Safetensors
roberta
zdanGL commited on
Commit
0d6ac39
·
verified ·
1 Parent(s): da6bd7d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +162 -1
README.md CHANGED
@@ -4,4 +4,165 @@ datasets:
4
  - wikimedia/structured-wikipedia
5
  base_model:
6
  - FacebookAI/roberta-large
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - wikimedia/structured-wikipedia
5
  base_model:
6
  - FacebookAI/roberta-large
7
+ ---
8
+
9
+ # RoBERTa Large Entity Linking
10
+
11
+ ## Model Description
12
+
13
+ **roberta-large-entity-linking** is a [RoBERTa large model](https://huggingface.co/FacebookAI/roberta-large) fine-tuned as a bi-encoder for entity linking tasks. The model separately embeds mentions-in-context and entity descriptions to enable semantic matching between text mentions and knowledge base entities.
14
+
15
+ ## Intended Uses
16
+
17
+ ### Primary Use Cases
18
+ - **Entity Linking:** Link Wikipedia concepts mentioned in text to their corresponding Wikipedia pages. With [this dataset](https://huggingface.co/datasets/wikimedia/structured-wikipedia) [Wikimedia](https://huggingface.co/wikimedia) makes it easy, you can embed their entries in the "abstract" column (you may need to do some cleanup to filter out irrelevant entries).
19
+ - **Zero-shot Entity Linking:** Link entities to knowledge bases without task-specific training
20
+ - **Knowledge Base Construction:** Build and reference new knowledge bases using the model's strong generalization capabilities
21
+
22
+ ### Recommended Preprocessing
23
+ - Use `[ENT]` tokens to mark entity mentions: `[ENT] mention [ENT]`
24
+ - Consider using NER models to identify candidate mentions
25
+ - For non-standard entities (e.g., "daytime"), extract noun phrases using NLTK or spaCy
26
+ - Clean and filter knowledge base entries to remove irrelevant concepts
27
+
28
+ ## Model Details
29
+
30
+
31
+ ### Training Data
32
+ - **Dataset:** 3 million pairs of Wikipedia anchor text links and Wikipedia page descriptions
33
+ - **Source:** Wikipedia anchor links paired with first few hundred words of target pages
34
+ - **Special Token:** `[ENT]` token added to mark entity mentions
35
+ - **Max Sequence Length:** 256 tokens (both mentions and descriptions)
36
+
37
+ ### Training Details
38
+ - **Hardware:** Single 80GB H100 GPU
39
+ - **Batch Size:** 80
40
+ - **Learning Rate:** 1e-5 with cosine scheduler
41
+ - **Loss Function:** Batch hard triplet loss (margin=0.4)
42
+ - **Inspiration:** Meta AI's BLINK and Google's "Learning Dense Representations for Entity Retrieval"
43
+
44
+ ## Performance
45
+
46
+ ### Benchmark Results
47
+ - **Dataset:** Zero-Shot Entity Linking (Logeswaran et al., 2019)
48
+ - **Metric:** Recall@64
49
+ - **Score:** 80.29%
50
+ - **Comparison:** Meta AI's BLINK achieves 82.06% on the same test set - slightly higher than ours, however, their model was trained on the training set but ours was not.
51
+ - **Conclusion:** Our model has strong zero-shot performance
52
+
53
+ ### Usage Recommendations
54
+ - **Similarity Threshold:** 0.7 for positive matches (based on empirical testing)
55
+
56
+ ## Code Example
57
+
58
+ ```python
59
+ import torch
60
+ import torch.nn.functional as F
61
+ from transformers import AutoTokenizer, AutoModel
62
+
63
+ if torch.cuda.is_available():
64
+ device = torch.device("cuda")
65
+ print(f"Using CUDA: {torch.cuda.get_device_name()}")
66
+ elif torch.backends.mps.is_available():
67
+ device = torch.device("mps")
68
+ print("Using MPS (Apple Silicon)")
69
+ else:
70
+ device = torch.device("cpu")
71
+ print("Using CPU")
72
+
73
+ model_name = "GlassLewis/roberta-large-entity-linking"
74
+
75
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
76
+ model = AutoModel.from_pretrained(model_name)
77
+
78
+ model.to(device)
79
+
80
+ # Verify the special token is there
81
+ print('[ENT]' in tokenizer.get_added_vocab())
82
+
83
+
84
+ context = "Tim Cook, [ENT] president [ENT] of Apple, is a guy who lives in California."
85
+
86
+ definitions = [
87
+ "A president is a leader of an organization, company, community, club, trade union, university or other group.",
88
+ "The president of the United States (POTUS) is the head of state and head of government of the United States.",
89
+ "A class president, also known as a class representative, is usually the leader of a student body class, and presides over its class cabinet or organization within a student council."
90
+ ]
91
+
92
+ tokenized_definition = tokenizer(
93
+ definitions,
94
+ truncation=True,
95
+ max_length=256,
96
+ padding='max_length',
97
+ return_tensors='pt'
98
+ )
99
+
100
+ tokenized_context = tokenizer(
101
+ context,
102
+ truncation=True,
103
+ max_length=256,
104
+ padding='max_length',
105
+ return_tensors='pt'
106
+ )
107
+
108
+ # Get embeddings
109
+ embedded_context = model(
110
+ input_ids=tokenized_context["input_ids"].to(device),
111
+ attention_mask=tokenized_context["attention_mask"].to(device)
112
+ )
113
+ embedded_definition = model(
114
+ input_ids=tokenized_definition["input_ids"].to(device),
115
+ attention_mask=tokenized_definition["attention_mask"].to(device)
116
+ )
117
+
118
+ # Normalize embeddings for proper cosine similarity
119
+ context_norm = F.normalize(embedded_context.last_hidden_state[:, 0, :], p=2, dim=1)
120
+ definition_norm = F.normalize(embedded_definition.last_hidden_state[:, 0, :], p=2, dim=1)
121
+
122
+ # Calculate cosine similarities
123
+ similarities = torch.matmul(context_norm, definition_norm.t())
124
+
125
+ print("Cosine similarities:")
126
+ print(similarities)
127
+
128
+ print("\nClassification results:")
129
+ for i, definition in enumerate(definitions):
130
+ sim_value = similarities[0, i].item()
131
+ print(f"Definition {i+1}: {definition}")
132
+ print(f"Similarity: {sim_value:.4f}\n")
133
+ ```
134
+
135
+ ## Input Format
136
+
137
+ ### Mention Context
138
+ - Mark target mentions with `[ENT]` tokens: `"Text with [ENT] entity mention [ENT] in context"`
139
+ - Maximum length: 256 tokens
140
+
141
+ ### Entity Descriptions
142
+ - Provide entity descriptions (e.g., Wikipedia abstracts)
143
+ - Maximum length: 256 tokens
144
+
145
+ ## Limitations and Biases
146
+
147
+ - **Language:** English only
148
+ - **Domain:** Primarily trained on Wikipedia data
149
+ - **Bias:** May inherit biases present in Wikipedia content
150
+ - **Performance:** Slightly lower than supervised models on in-domain tasks
151
+
152
+ ## References
153
+
154
+ - Logeswaran et al. (2019). [Zero-shot Entity Linking with Efficient Long Range Sequence Modeling](https://arxiv.org/pdf/1906.07348)
155
+ - Meta AI BLINK: [GitHub Repository](https://github.com/facebookresearch/BLINK)
156
+ - Google's Learning Dense Representations for Entity Retrieval
157
+
158
+ ## Citation
159
+
160
+ ```bibtex
161
+ @misc{roberta-large-entity-linking,
162
+ author = {[Your Name/Organization]},
163
+ title = {RoBERTa Large Entity Linking},
164
+ year = {2024},
165
+ publisher = {Hugging Face},
166
+ url = {https://huggingface.co/GlassLewis/roberta-large-entity-linking}
167
+ }
168
+ ```