--- license: apache-2.0 language: - en base_model: - cisco-ai/SecureBERT2.0-base pipeline_tag: sentence-similarity library_name: sentence-transformers tags: - IR - reranking - securebert - docembedding --- # Model Card for cisco-ai/SecureBERT2.0-cross-encoder The **SecureBERT 2.0 Cross-Encoder** is a cybersecurity domain-specific model fine-tuned from [SecureBERT 2.0](https://huggingface.co/cisco-ai/SecureBERT2.0-base). It computes **pairwise similarity scores** between two texts, enabling use in **text reranking, semantic search, and cybersecurity intelligence retrieval** tasks. --- ## Model Details ### Model Description - **Developed by:** Cisco AI - **Model type:** Cross Encoder (Sentence Similarity) - **Architecture:** ModernBERT (fine-tuned via Sentence Transformers) - **Max Sequence Length:** 1024 tokens - **Output Labels:** 1 (similarity score) - **Language:** English - **License:** Apache-2.0 - **Finetuned from model:** [cisco-ai/SecureBERT2.0-base](https://huggingface.co/cisco-ai/SecureBERT2.0-base) ## Uses ### Direct Use - Semantic text similarity in cybersecurity contexts - Text and code reranking for information retrieval (IR) - Threat intelligence question–answer relevance scoring - Cybersecurity report and log correlation ### Downstream Use Can be integrated into: - Cyber threat intelligence search engines - SOC automation pipelines - Cybersecurity knowledge graph enrichment - Threat hunting and incident response systems ### Out-of-Scope Use - Generic text similarity outside the cybersecurity domain - Tasks requiring generative reasoning or open-domain question answering --- ## Bias, Risks, and Limitations The model reflects the distribution of cybersecurity-related data used during fine-tuning. Potential risks include: - Overrepresentation of specific malware, technologies, or threat actors - Bias toward technical English sources - Reduced performance on non-English or mixed technical/natural text ### Recommendations Users should evaluate results for domain alignment and combine with other retrieval models or heuristic filters when applied to non-cybersecurity contexts. --- ## How to Get Started with the Model ### Using the Sentence Transformers API #### Install dependencies ```bash pip install -U sentence-transformers ``` ### Run Inference ```python from sentence_transformers import CrossEncoder # Load the model model = CrossEncoder("cisco-ai/SecureBERT2.0-cross-encoder") # Example pairs pairs = [ ["How does Stealc malware extract browser data?", "Stealc uses Sqlite3 DLL to query browser databases and retrieve cookies, passwords, and history."], ["Best practices for post-acquisition cybersecurity integration?", "Conduct security assessment, align policies, integrate security technologies, and train employees."], ] # Compute similarity scores scores = model.predict(pairs) print(scores) ``` ### Rank Candidate Responses ```python query = "How to prevent Kerberoasting attacks?" candidates = [ "Implement MFA and privileged access management", "Monitor Kerberos tickets for anomalous activity", "Apply zero-trust network segmentation", ] ranking = model.rank(query, candidates) print(ranking) ``` ## Framework Versions * python: 3.10.10 * sentence_transformers: 5.0.0 * transformers: 4.52.4 * PyTorch: 2.7.0+cu128 * accelerate: 1.9.0 * datasets: 3.6.0 --- ## Training Details ### Training Dataset The model was fine-tuned on a **cybersecurity sentence-pair similarity dataset** for cross-encoder training. - **Dataset Size:** 35,705 samples - **Columns:** `sentence1`, `sentence2`, `label` #### Average Lengths (first 1000 samples) | Field | Mean Length | |:------|:-------------:| | Sentence1 | 98.46 | | Sentence2 | 1468.34 | | Label | 1.0 | #### Example Schema | Field | Type | Description | |:------|:------|:------------| | sentence1 | string | Query or document text | | sentence2 | string | Paired document or candidate response | | label | float | Similarity score between the two inputs | --- ### Training Objective and Loss The model was trained using a **contrastive ranking objective** to learn high-quality similarity scores between cybersecurity-related text pairs. - **Loss Function:** [CachedMultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/cross_encoder/losses.html#cachedmultiplenegativesrankingloss) #### Loss Parameters ```json { "scale": 10.0, "num_negatives": 10, "activation_fn": "torch.nn.modules.activation.Sigmoid", "mini_batch_size": 24 } ``` ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data The evaluation was performed on a **held-out test set** of cybersecurity-related question–answer pairs and document retrieval tasks. Data includes: - Threat intelligence descriptions and related advisories - Exploit procedure and mitigation text pairs - Cybersecurity Q&A and incident analysis examples #### Factors Evaluation considered multiple aspects of similarity and relevance: - **Domain diversity:** different cybersecurity subfields (malware, vulnerabilities, network defense) - **Task diversity:** retrieval, reranking, and relevance scoring - **Pair length:** from short queries to long technical documents #### Metrics The model was evaluated using standard information retrieval metrics: - **Mean Average Precision (mAP):** measures ranking precision across all retrieved results - **Recall@1 (R@1):** measures the proportion of correct top-1 matches - **Normalized Discounted Cumulative Gain (NDCG@10):** evaluates ranking quality up to the 10th result - **Mean Reciprocal Rank (MRR@10):** assesses the average rank position of the first correct answer ### Results | Model | mAP | R@1 | NDCG@10 | MRR@10 | |:------|:----:|:---:|:--------:|:--------:| | **ms-marco-TinyBERT-L2** | 0.920 | 0.849 | 0.964 | 0.955 | | **SecureBERT 2.0 Cross-Encoder** | **0.955** | **0.948** | **0.986** | **0.983** | #### Summary The **SecureBERT 2.0 Cross-Encoder** achieves **state-of-the-art retrieval and ranking performance** on cybersecurity text similarity tasks. Compared to the general-purpose `ms-marco-TinyBERT-L2` baseline: - It improves **mAP** by +0.035 - Achieves nearly perfect **R@1** and **MRR@10**, indicating highly accurate top-1 retrieval - Shows the strongest **NDCG@10**, reflecting excellent ranking quality across top results These results confirm that **domain-specific pretraining and fine-tuning** substantially enhance semantic understanding and information retrieval capabilities in cybersecurity applications. --- # Cite: Bibtex ``` @article{aghaei2025securebert, title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence}, author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun}, journal={arXiv preprint arXiv:2510.00240}, year={2025} } ``` ## Model Card Authors Cisco AI ## Model Card Contact For inquiries, please contact [ai-threat-intel@cisco.com](mailto:ai-threat-intel@cisco.com)