---
license: apache-2.0
language:
- en
base_model:
- cisco-ai/SecureBERT2.0-base
pipeline_tag: sentence-similarity
library_name: sentence-transformers
tags:
- IR
- reranking
- securebert
- docembedding
---

# Model Card for cisco-ai/SecureBERT2.0-cross-encoder

The **SecureBERT 2.0 Cross-Encoder** is a cybersecurity domain-specific model fine-tuned from [SecureBERT 2.0](https://huggingface.co/cisco-ai/SecureBERT2.0-base).  
It computes **pairwise similarity scores** between two texts, enabling use in **text reranking, semantic search, and cybersecurity intelligence retrieval** tasks.

---

## Model Details

### Model Description

- **Developed by:** Cisco AI  
- **Model type:** Cross Encoder (Sentence Similarity)  
- **Architecture:** ModernBERT (fine-tuned via Sentence Transformers)  
- **Max Sequence Length:** 1024 tokens  
- **Output Labels:** 1 (similarity score)  
- **Language:** English  
- **License:** Apache-2.0  
- **Finetuned from model:** [cisco-ai/SecureBERT2.0-base](https://huggingface.co/cisco-ai/SecureBERT2.0-base)


## Uses

### Direct Use

- Semantic text similarity in cybersecurity contexts  
- Text and code reranking for information retrieval (IR)  
- Threat intelligence question–answer relevance scoring  
- Cybersecurity report and log correlation  

### Downstream Use

Can be integrated into:
- Cyber threat intelligence search engines  
- SOC automation pipelines  
- Cybersecurity knowledge graph enrichment  
- Threat hunting and incident response systems  

### Out-of-Scope Use

- Generic text similarity outside the cybersecurity domain  
- Tasks requiring generative reasoning or open-domain question answering  

---

## Bias, Risks, and Limitations

The model reflects the distribution of cybersecurity-related data used during fine-tuning.  
Potential risks include:
- Overrepresentation of specific malware, technologies, or threat actors  
- Bias toward technical English sources  
- Reduced performance on non-English or mixed technical/natural text  

### Recommendations

Users should evaluate results for domain alignment and combine with other retrieval models or heuristic filters when applied to non-cybersecurity contexts.

---

## How to Get Started with the Model

### Using the Sentence Transformers API

#### Install dependencies
```bash
pip install -U sentence-transformers
```

### Run Inference
```python
from sentence_transformers import CrossEncoder

# Load the model
model = CrossEncoder("cisco-ai/SecureBERT2.0-cross-encoder")

# Example pairs
pairs = [
    ["How does Stealc malware extract browser data?", 
     "Stealc uses Sqlite3 DLL to query browser databases and retrieve cookies, passwords, and history."],
    ["Best practices for post-acquisition cybersecurity integration?", 
     "Conduct security assessment, align policies, integrate security technologies, and train employees."],
]

# Compute similarity scores
scores = model.predict(pairs)
print(scores)
```

### Rank Candidate Responses
```python
query = "How to prevent Kerberoasting attacks?"
candidates = [
    "Implement MFA and privileged access management",
    "Monitor Kerberos tickets for anomalous activity",
    "Apply zero-trust network segmentation",
]
ranking = model.rank(query, candidates)
print(ranking)
```

## Framework Versions

* python: 3.10.10
* sentence_transformers: 5.0.0
* transformers: 4.52.4
* PyTorch: 2.7.0+cu128
* accelerate: 1.9.0
* datasets: 3.6.0
 
---

## Training Details

### Training Dataset

The model was fine-tuned on a **cybersecurity sentence-pair similarity dataset** for cross-encoder training.

- **Dataset Size:** 35,705 samples  
- **Columns:** `sentence1`, `sentence2`, `label`  

#### Average Lengths (first 1000 samples)

| Field | Mean Length |
|:------|:-------------:|
| Sentence1 | 98.46 |
| Sentence2 | 1468.34 |
| Label | 1.0 |

#### Example Schema

| Field | Type | Description |
|:------|:------|:------------|
| sentence1 | string | Query or document text |
| sentence2 | string | Paired document or candidate response |
| label | float | Similarity score between the two inputs |

---

### Training Objective and Loss

The model was trained using a **contrastive ranking objective** to learn high-quality similarity scores between cybersecurity-related text pairs.

- **Loss Function:** [CachedMultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/cross_encoder/losses.html#cachedmultiplenegativesrankingloss)  

#### Loss Parameters
```json
{
    "scale": 10.0,
    "num_negatives": 10,
    "activation_fn": "torch.nn.modules.activation.Sigmoid",
    "mini_batch_size": 24
}
```

 
## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

The evaluation was performed on a **held-out test set** of cybersecurity-related question–answer pairs and document retrieval tasks.  
Data includes:
- Threat intelligence descriptions and related advisories  
- Exploit procedure and mitigation text pairs  
- Cybersecurity Q&A and incident analysis examples  

#### Factors

Evaluation considered multiple aspects of similarity and relevance:
- **Domain diversity:** different cybersecurity subfields (malware, vulnerabilities, network defense)  
- **Task diversity:** retrieval, reranking, and relevance scoring  
- **Pair length:** from short queries to long technical documents  

#### Metrics

The model was evaluated using standard information retrieval metrics:
- **Mean Average Precision (mAP):** measures ranking precision across all retrieved results  
- **Recall@1 (R@1):** measures the proportion of correct top-1 matches  
- **Normalized Discounted Cumulative Gain (NDCG@10):** evaluates ranking quality up to the 10th result  
- **Mean Reciprocal Rank (MRR@10):** assesses the average rank position of the first correct answer  

### Results

| Model | mAP | R@1 | NDCG@10 | MRR@10 |
|:------|:----:|:---:|:--------:|:--------:|
| **ms-marco-TinyBERT-L2** | 0.920 | 0.849 | 0.964 | 0.955 |
| **SecureBERT 2.0 Cross-Encoder** | **0.955** | **0.948** | **0.986** | **0.983** |

#### Summary

The **SecureBERT 2.0 Cross-Encoder** achieves **state-of-the-art retrieval and ranking performance** on cybersecurity text similarity tasks.  

Compared to the general-purpose `ms-marco-TinyBERT-L2` baseline:
- It improves **mAP** by +0.035  
- Achieves nearly perfect **R@1** and **MRR@10**, indicating highly accurate top-1 retrieval  
- Shows the strongest **NDCG@10**, reflecting excellent ranking quality across top results  

These results confirm that **domain-specific pretraining and fine-tuning** substantially enhance semantic understanding and information retrieval capabilities in cybersecurity applications.

---
# Cite:

Bibtex

```
@article{aghaei2025securebert,
  title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
  author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
  journal={arXiv preprint arXiv:2510.00240},
  year={2025}
}
```

## Model Card Authors

Cisco AI

## Model Card Contact

For inquiries, please contact [ai-threat-intel@cisco.com](mailto:ai-threat-intel@cisco.com)