SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: sentence-transformers/all-MiniLM-L6-v2
  • Maximum Sequence Length: 256 tokens
  • Output Dimensionality: 384 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("ayushexel/embed-all-MiniLM-L6-v2-squad-1-epochs")
# Run inference
sentences = [
    'What does this eliminate?',
    'The successful outcome of antimicrobial therapy with antibacterial compounds depends on several factors. These include host defense mechanisms, the location of infection, and the pharmacokinetic and pharmacodynamic properties of the antibacterial. A bactericidal activity of antibacterials may depend on the bacterial growth phase, and it often requires ongoing metabolic activity and division of bacterial cells. These findings are based on laboratory studies, and in clinical settings have also been shown to eliminate bacterial infection. Since the activity of antibacterials depends frequently on its concentration, in vitro characterization of antibacterial activity commonly includes the determination of the minimum inhibitory concentration and minimum bactericidal concentration of an antibacterial. To predict clinical outcome, the antimicrobial activity of an antibacterial is usually combined with its pharmacokinetic profile, and several pharmacological parameters are used as markers of drug efficacy.',
    'There were 13 finalists this season, but two were eliminated in the first result show of the finals. A new feature introduced was the "Judges\' Save", and Matt Giraud was saved from elimination at the top seven by the judges when he received the fewest votes. The next week, Lil Rounds and Anoop Desai were eliminated.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 0.3988

Training Details

Training Dataset

Unnamed Dataset

  • Size: 44,288 training samples
  • Columns: question, context, and negative
  • Approximate statistics based on the first 1000 samples:
    question context negative
    type string string string
    details
    • min: 7 tokens
    • mean: 14.66 tokens
    • max: 48 tokens
    • min: 28 tokens
    • mean: 148.92 tokens
    • max: 256 tokens
    • min: 32 tokens
    • mean: 152.77 tokens
    • max: 256 tokens
  • Samples:
    question context negative
    What two people claim the title of Education Minister was often seen next to Tai Situ Changchub Gyaltsen's name in Tibetan texts? Wang and Nyima state that after the official title "Education Minister" was granted to Tai Situ Changchub Gyaltsen (1302–1364) by the Yuan court, this title appeared frequently with his name in various Tibetan texts, while his Tibetan title "Degsi" (sic properly sde-srid or desi) is seldom mentioned. Wang and Nyima take this to mean that "even in the later period of the Yuan dynasty, the Yuan imperial court and the Phagmodrupa Dynasty maintained a Central-local government relation." The Tai Situpa is even supposed to have written in his will: "In the past I received loving care from the emperor in the east. If the emperor continues to care for us, please follow his edicts and the imperial envoy should be well received." The Ming court appointed three Princes of Dharma (法王) and five Princes (王), and granted many other titles, such as Grand State Tutors (大國師) and State Tutors (國師), to the important schools of Tibetan Buddhism, including the Karma Kagyu, Sakya, and Gelug. According to Wang Jiawei and Nyima Gyaincain, leading officials of these organs were all appointed by the central government and were subject to the rule of law. Yet Van Praag describes the distinct and long-lasting Tibetan law code established by the Phagmodru ruler Tai Situ Changchub Gyaltsen as one of many reforms to revive old Imperial Tibetan traditions.
    From what dialect is Hindi descended? Sanskrit has greatly influenced the languages of India that grew from its vocabulary and grammatical base; for instance, Hindi is a "Sanskritised register" of the Khariboli dialect. All modern Indo-Aryan languages, as well as Munda and Dravidian languages, have borrowed many words either directly from Sanskrit (tatsama words), or indirectly via middle Indo-Aryan languages (tadbhava words). Words originating in Sanskrit are estimated at roughly fifty percent of the vocabulary of modern Indo-Aryan languages, as well as the literary forms of Malayalam and Kannada. Literary texts in Telugu are lexically Sanskrit or Sanskritised to an enormous extent, perhaps seventy percent or more. The status of "language" is not solely determined by linguistic criteria, but it is also the result of a historical and political development. Romansh came to be a written language, and therefore it is recognized as a language, even though it is very close to the Lombardic alpine dialects. An opposite example is the case of Chinese, whose variations such as Mandarin and Cantonese are often called dialects and not languages, despite their mutual unintelligibility.
    Through what language did bitumen pass to reach English? The expression "bitumen" originated in the Sanskrit, where we find the words jatu, meaning "pitch," and jatu-krit, meaning "pitch creating", "pitch producing" (referring to coniferous or resinous trees). The Latin equivalent is claimed by some to be originally gwitu-men (pertaining to pitch), and by others, pixtumens (exuding or bubbling pitch), which was subsequently shortened to bitumen, thence passing via French into English. From the same root is derived the Anglo Saxon word cwidu (mastix), the German word Kitt (cement or mastic) and the old Norse word kvada. Old English contained a certain number of loanwords from Latin, which was the scholarly and diplomatic lingua franca of Western Europe. It is sometimes possible to give approximate dates for the borrowing of individual Latin words based on which patterns of sound change they have undergone. Some Latin words had already been borrowed into the Germanic languages before the ancestral Angles and Saxons left continental Europe for Britain. More entered the language when the Anglo-Saxons were converted to Christianity and Latin-speaking priests became influential. It was also through Irish Christian missionaries that the Latin alphabet was introduced and adapted for the writing of Old English, replacing the earlier runic system. Nonetheless, the largest transfer of Latin-based (mainly Old French) words into English occurred after the Norman Conquest of 1066, and thus in the Middle English rather than the Old English period.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 5,000 evaluation samples
  • Columns: question, context, and negative_1
  • Approximate statistics based on the first 1000 samples:
    question context negative_1
    type string string string
    details
    • min: 6 tokens
    • mean: 14.53 tokens
    • max: 36 tokens
    • min: 28 tokens
    • mean: 149.15 tokens
    • max: 256 tokens
    • min: 28 tokens
    • mean: 147.07 tokens
    • max: 256 tokens
  • Samples:
    question context negative_1
    On what coast of Costa Rica is Limón Creole English spoken? There is no universally accepted criterion for distinguishing two different languages from two dialects (i.e. varieties) of the same language. A number of rough measures exist, sometimes leading to contradictory results. The distinction is therefore subjective and depends on the user's frame of reference. For example, there is discussion about if the Limón Creole English must be considered as "a kind" of English or a different language. This creole is spoken in the Caribbean coast of Costa Rica (Central America) by descendant of Jamaican people. The position that Costa Rican linguists support depends on the University they belong. There is no universally accepted criterion for distinguishing two different languages from two dialects (i.e. varieties) of the same language. A number of rough measures exist, sometimes leading to contradictory results. The distinction is therefore subjective and depends on the user's frame of reference. For example, there is discussion about if the Limón Creole English must be considered as "a kind" of English or a different language. This creole is spoken in the Caribbean coast of Costa Rica (Central America) by descendant of Jamaican people. The position that Costa Rican linguists support depends on the University they belong.
    How many companies did Apple promise were develping products for the new computer? Jobs stated during the Macintosh's introduction "we expect Macintosh to become the third industry standard", after the Apple II and IBM PC. Although outselling every other computer, it did not meet expectations during the first year, especially among business customers. Only about ten applications including MacWrite and MacPaint were widely available, although many non-Apple software developers participated in the introduction and Apple promised that 79 companies including Lotus, Digital Research, and Ashton-Tate were creating products for the new computer. After one year, it had less than one quarter of the software selection available compared to the IBM PC—including only one word processor, two databases, and one spreadsheet—although Apple had sold 280,000 Macintoshes compared to IBM's first year sales of fewer than 100,000 PCs. Jobs stated during the Macintosh's introduction "we expect Macintosh to become the third industry standard", after the Apple II and IBM PC. Although outselling every other computer, it did not meet expectations during the first year, especially among business customers. Only about ten applications including MacWrite and MacPaint were widely available, although many non-Apple software developers participated in the introduction and Apple promised that 79 companies including Lotus, Digital Research, and Ashton-Tate were creating products for the new computer. After one year, it had less than one quarter of the software selection available compared to the IBM PC—including only one word processor, two databases, and one spreadsheet—although Apple had sold 280,000 Macintoshes compared to IBM's first year sales of fewer than 100,000 PCs.
    What s Boston sometimes called? Boston is sometimes called a "city of neighborhoods" because of the profusion of diverse subsections; the city government's Office of Neighborhood Services has officially designated 23 neighborhoods. In 1822, the citizens of Boston voted to change the official name from "the Town of Boston" to "the City of Boston", and on March 4, 1822, the people of Boston accepted the charter incorporating the City. At the time Boston was chartered as a city, the population was about 46,226, while the area of the city was only 4.7 square miles (12 km2).
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • fp16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss gooqa-dev_cosine_accuracy
-1 -1 - - 0.3266
0.2890 100 0.4117 0.7775 0.3962
0.5780 200 0.391 0.7681 0.3882
0.8671 300 0.3771 0.7492 0.4018
-1 -1 - - 0.3988

Framework Versions

  • Python: 3.11.0
  • Sentence Transformers: 4.0.1
  • Transformers: 4.50.3
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.5.2
  • Datasets: 3.5.0
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
-
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ayushexel/embed-all-MiniLM-L6-v2-squad-1-epochs

Finetuned
(758)
this model

Papers for ayushexel/embed-all-MiniLM-L6-v2-squad-1-epochs

Evaluation results