GlassLewis
/

roberta-large-entity-linking

 - wikimedia/structured-wikipedia
 base_model:
 - FacebookAI/roberta-large
+---
+# RoBERTa Large Entity Linking
+## Model Description
+**roberta-large-entity-linking** is a [RoBERTa large model](https://huggingface.co/FacebookAI/roberta-large) fine-tuned as a bi-encoder for entity linking tasks. The model separately embeds mentions-in-context and entity descriptions to enable semantic matching between text mentions and knowledge base entities.
+## Intended Uses
+### Primary Use Cases
+- **Entity Linking:** Link Wikipedia concepts mentioned in text to their corresponding Wikipedia pages. With [this dataset](https://huggingface.co/datasets/wikimedia/structured-wikipedia) [Wikimedia](https://huggingface.co/wikimedia) makes it easy, you can embed their entries in the "abstract" column (you may need to do some cleanup to filter out irrelevant entries).
+- **Zero-shot Entity Linking:** Link entities to knowledge bases without task-specific training
+- **Knowledge Base Construction:** Build and reference new knowledge bases using the model's strong generalization capabilities
+### Recommended Preprocessing
+- Use `[ENT]` tokens to mark entity mentions: `[ENT] mention [ENT]`
+- Consider using NER models to identify candidate mentions
+- For non-standard entities (e.g., "daytime"), extract noun phrases using NLTK or spaCy
+- Clean and filter knowledge base entries to remove irrelevant concepts
+## Model Details
+### Training Data
+- **Dataset:** 3 million pairs of Wikipedia anchor text links and Wikipedia page descriptions
+- **Source:** Wikipedia anchor links paired with first few hundred words of target pages
+- **Special Token:** `[ENT]` token added to mark entity mentions
+- **Max Sequence Length:** 256 tokens (both mentions and descriptions)
+### Training Details
+- **Hardware:** Single 80GB H100 GPU
+- **Batch Size:** 80
+- **Learning Rate:** 1e-5 with cosine scheduler
+- **Loss Function:** Batch hard triplet loss (margin=0.4)
+- **Inspiration:** Meta AI's BLINK and Google's "Learning Dense Representations for Entity Retrieval"
+## Performance
+### Benchmark Results
+- **Dataset:** Zero-Shot Entity Linking (Logeswaran et al., 2019)
+- **Metric:** Recall@64
+- **Score:** 80.29%
+- **Comparison:** Meta AI's BLINK achieves 82.06% on the same test set - slightly higher than ours, however, their model was trained on the training set but ours was not.
+- **Conclusion:** Our model has strong zero-shot performance
+### Usage Recommendations
+- **Similarity Threshold:** 0.7 for positive matches (based on empirical testing)
+## Code Example
+```python
+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, AutoModel
+if torch.cuda.is_available():
+	device = torch.device("cuda")
+	print(f"Using CUDA: {torch.cuda.get_device_name()}")
+elif torch.backends.mps.is_available():
+	device = torch.device("mps")
+	print("Using MPS (Apple Silicon)")
+else:
+	device = torch.device("cpu")
+	print("Using CPU")
+model_name = "GlassLewis/roberta-large-entity-linking"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name)
+model.to(device)
+# Verify the special token is there
+print('[ENT]' in tokenizer.get_added_vocab())
+context = "Tim Cook, [ENT] president [ENT] of Apple, is a guy who lives in California."
+definitions = [
+    "A president is a leader of an organization, company, community, club, trade union, university or other group.",
+    "The president of the United States (POTUS) is the head of state and head of government of the United States.",
+    "A class president, also known as a class representative, is usually the leader of a student body class, and presides over its class cabinet or organization within a student council."
+]
+tokenized_definition = tokenizer(
+    definitions,
+    truncation=True,
+    max_length=256,
+    padding='max_length',
+    return_tensors='pt'
+)
+tokenized_context = tokenizer(
+    context,
+    truncation=True,
+    max_length=256,
+    padding='max_length',
+    return_tensors='pt'
+)
+# Get embeddings
+embedded_context = model(
+    input_ids=tokenized_context["input_ids"].to(device),
+    attention_mask=tokenized_context["attention_mask"].to(device)
+)
+embedded_definition = model(
+    input_ids=tokenized_definition["input_ids"].to(device),
+    attention_mask=tokenized_definition["attention_mask"].to(device)
+)
+# Normalize embeddings for proper cosine similarity
+context_norm = F.normalize(embedded_context.last_hidden_state[:, 0, :], p=2, dim=1)
+definition_norm = F.normalize(embedded_definition.last_hidden_state[:, 0, :], p=2, dim=1)
+# Calculate cosine similarities
+similarities = torch.matmul(context_norm, definition_norm.t())
+print("Cosine similarities:")
+print(similarities)
+print("\nClassification results:")
+for i, definition in enumerate(definitions):
+    sim_value = similarities[0, i].item()
+    print(f"Definition {i+1}: {definition}")
+    print(f"Similarity: {sim_value:.4f}\n")
+```
+## Input Format
+### Mention Context
+- Mark target mentions with `[ENT]` tokens: `"Text with [ENT] entity mention [ENT] in context"`
+- Maximum length: 256 tokens
+### Entity Descriptions
+- Provide entity descriptions (e.g., Wikipedia abstracts)
+- Maximum length: 256 tokens
+## Limitations and Biases
+- **Language:** English only
+- **Domain:** Primarily trained on Wikipedia data
+- **Bias:** May inherit biases present in Wikipedia content
+- **Performance:** Slightly lower than supervised models on in-domain tasks
+## References
+- Logeswaran et al. (2019). [Zero-shot Entity Linking with Efficient Long Range Sequence Modeling](https://arxiv.org/pdf/1906.07348)
+- Meta AI BLINK: [GitHub Repository](https://github.com/facebookresearch/BLINK)
+- Google's Learning Dense Representations for Entity Retrieval
+## Citation
+```bibtex
+@misc{roberta-large-entity-linking,
+  author = {[Your Name/Organization]},
+  title = {RoBERTa Large Entity Linking},
+  year = {2024},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/GlassLewis/roberta-large-entity-linking}
+}
+```