SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/all-MiniLM-L6-v2
Maximum Sequence Length: 256 tokens
Output Dimensionality: 384 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("ayushexel/embed-all-MiniLM-L6-v2-squad-1-epochs")
# Run inference
sentences = [
    'What does this eliminate?',
    'The successful outcome of antimicrobial therapy with antibacterial compounds depends on several factors. These include host defense mechanisms, the location of infection, and the pharmacokinetic and pharmacodynamic properties of the antibacterial. A bactericidal activity of antibacterials may depend on the bacterial growth phase, and it often requires ongoing metabolic activity and division of bacterial cells. These findings are based on laboratory studies, and in clinical settings have also been shown to eliminate bacterial infection. Since the activity of antibacterials depends frequently on its concentration, in vitro characterization of antibacterial activity commonly includes the determination of the minimum inhibitory concentration and minimum bactericidal concentration of an antibacterial. To predict clinical outcome, the antimicrobial activity of an antibacterial is usually combined with its pharmacokinetic profile, and several pharmacological parameters are used as markers of drug efficacy.',
    'There were 13 finalists this season, but two were eliminated in the first result show of the finals. A new feature introduced was the "Judges\' Save", and Matt Giraud was saved from elimination at the top seven by the judges when he received the fewest votes. The next week, Lil Rounds and Anoop Desai were eliminated.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Dataset: gooqa-dev
Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	0.3988

Training Details

Training Dataset

Unnamed Dataset

Size: 44,288 training samples
Columns: question, context, and negative

Approximate statistics based on the first 1000 samples:

	question	context	negative
type	string	string	string
details	min: 7 tokens mean: 14.66 tokens max: 48 tokens	min: 28 tokens mean: 148.92 tokens max: 256 tokens	min: 32 tokens mean: 152.77 tokens max: 256 tokens

Samples:

question	context	negative
`What two people claim the title of Education Minister was often seen next to Tai Situ Changchub Gyaltsen's name in Tibetan texts?`	Wang and Nyima state that after the official title "Education Minister" was granted to Tai Situ Changchub Gyaltsen (1302–1364) by the Yuan court, this title appeared frequently with his name in various Tibetan texts, while his Tibetan title "Degsi" (sic properly sde-srid or desi) is seldom mentioned. Wang and Nyima take this to mean that "even in the later period of the Yuan dynasty, the Yuan imperial court and the Phagmodrupa Dynasty maintained a Central-local government relation." The Tai Situpa is even supposed to have written in his will: "In the past I received loving care from the emperor in the east. If the emperor continues to care for us, please follow his edicts and the imperial envoy should be well received."	The Ming court appointed three Princes of Dharma (法王) and five Princes (王), and granted many other titles, such as Grand State Tutors (大國師) and State Tutors (國師), to the important schools of Tibetan Buddhism, including the Karma Kagyu, Sakya, and Gelug. According to Wang Jiawei and Nyima Gyaincain, leading officials of these organs were all appointed by the central government and were subject to the rule of law. Yet Van Praag describes the distinct and long-lasting Tibetan law code established by the Phagmodru ruler Tai Situ Changchub Gyaltsen as one of many reforms to revive old Imperial Tibetan traditions.
`From what dialect is Hindi descended?`	Sanskrit has greatly influenced the languages of India that grew from its vocabulary and grammatical base; for instance, Hindi is a "Sanskritised register" of the Khariboli dialect. All modern Indo-Aryan languages, as well as Munda and Dravidian languages, have borrowed many words either directly from Sanskrit (tatsama words), or indirectly via middle Indo-Aryan languages (tadbhava words). Words originating in Sanskrit are estimated at roughly fifty percent of the vocabulary of modern Indo-Aryan languages, as well as the literary forms of Malayalam and Kannada. Literary texts in Telugu are lexically Sanskrit or Sanskritised to an enormous extent, perhaps seventy percent or more.	`The status of "language" is not solely determined by linguistic criteria, but it is also the result of a historical and political development. Romansh came to be a written language, and therefore it is recognized as a language, even though it is very close to the Lombardic alpine dialects. An opposite example is the case of Chinese, whose variations such as Mandarin and Cantonese are often called dialects and not languages, despite their mutual unintelligibility.`
`Through what language did bitumen pass to reach English?`	The expression "bitumen" originated in the Sanskrit, where we find the words jatu, meaning "pitch," and jatu-krit, meaning "pitch creating", "pitch producing" (referring to coniferous or resinous trees). The Latin equivalent is claimed by some to be originally gwitu-men (pertaining to pitch), and by others, pixtumens (exuding or bubbling pitch), which was subsequently shortened to bitumen, thence passing via French into English. From the same root is derived the Anglo Saxon word cwidu (mastix), the German word Kitt (cement or mastic) and the old Norse word kvada.	Old English contained a certain number of loanwords from Latin, which was the scholarly and diplomatic lingua franca of Western Europe. It is sometimes possible to give approximate dates for the borrowing of individual Latin words based on which patterns of sound change they have undergone. Some Latin words had already been borrowed into the Germanic languages before the ancestral Angles and Saxons left continental Europe for Britain. More entered the language when the Anglo-Saxons were converted to Christianity and Latin-speaking priests became influential. It was also through Irish Christian missionaries that the Latin alphabet was introduced and adapted for the writing of Old English, replacing the earlier runic system. Nonetheless, the largest transfer of Latin-based (mainly Old French) words into English occurred after the Norman Conquest of 1066, and thus in the Middle English rather than the Old English period.

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Dataset

Unnamed Dataset

Size: 5,000 evaluation samples
Columns: question, context, and negative_1

Approximate statistics based on the first 1000 samples:

	question	context	negative_1
type	string	string	string
details	min: 6 tokens mean: 14.53 tokens max: 36 tokens	min: 28 tokens mean: 149.15 tokens max: 256 tokens	min: 28 tokens mean: 147.07 tokens max: 256 tokens

Samples:

question	context	negative_1
`On what coast of Costa Rica is Limón Creole English spoken?`	There is no universally accepted criterion for distinguishing two different languages from two dialects (i.e. varieties) of the same language. A number of rough measures exist, sometimes leading to contradictory results. The distinction is therefore subjective and depends on the user's frame of reference. For example, there is discussion about if the Limón Creole English must be considered as "a kind" of English or a different language. This creole is spoken in the Caribbean coast of Costa Rica (Central America) by descendant of Jamaican people. The position that Costa Rican linguists support depends on the University they belong.	There is no universally accepted criterion for distinguishing two different languages from two dialects (i.e. varieties) of the same language. A number of rough measures exist, sometimes leading to contradictory results. The distinction is therefore subjective and depends on the user's frame of reference. For example, there is discussion about if the Limón Creole English must be considered as "a kind" of English or a different language. This creole is spoken in the Caribbean coast of Costa Rica (Central America) by descendant of Jamaican people. The position that Costa Rican linguists support depends on the University they belong.
`How many companies did Apple promise were develping products for the new computer?`	Jobs stated during the Macintosh's introduction "we expect Macintosh to become the third industry standard", after the Apple II and IBM PC. Although outselling every other computer, it did not meet expectations during the first year, especially among business customers. Only about ten applications including MacWrite and MacPaint were widely available, although many non-Apple software developers participated in the introduction and Apple promised that 79 companies including Lotus, Digital Research, and Ashton-Tate were creating products for the new computer. After one year, it had less than one quarter of the software selection available compared to the IBM PC—including only one word processor, two databases, and one spreadsheet—although Apple had sold 280,000 Macintoshes compared to IBM's first year sales of fewer than 100,000 PCs.	Jobs stated during the Macintosh's introduction "we expect Macintosh to become the third industry standard", after the Apple II and IBM PC. Although outselling every other computer, it did not meet expectations during the first year, especially among business customers. Only about ten applications including MacWrite and MacPaint were widely available, although many non-Apple software developers participated in the introduction and Apple promised that 79 companies including Lotus, Digital Research, and Ashton-Tate were creating products for the new computer. After one year, it had less than one quarter of the software selection available compared to the IBM PC—including only one word processor, two databases, and one spreadsheet—although Apple had sold 280,000 Macintoshes compared to IBM's first year sales of fewer than 100,000 PCs.
`What s Boston sometimes called?`	`Boston is sometimes called a "city of neighborhoods" because of the profusion of diverse subsections; the city government's Office of Neighborhood Services has officially designated 23 neighborhoods.`	`In 1822, the citizens of Boston voted to change the official name from "the Town of Boston" to "the City of Boston", and on March 4, 1822, the people of Boston accepted the charter incorporating the City. At the time Boston was chartered as a city, the population was about 46,226, while the area of the city was only 4.7 square miles (12 km2).`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
num_train_epochs: 1
warmup_ratio: 0.1
fp16: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
tp_size: 0
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	Validation Loss	gooqa-dev_cosine_accuracy
-1	-1	-	-	0.3266
0.2890	100	0.4117	0.7775	0.3962
0.5780	200	0.391	0.7681	0.3882
0.8671	300	0.3771	0.7492	0.4018
-1	-1	-	-	0.3988

Framework Versions

Python: 3.11.0
Sentence Transformers: 4.0.1
Transformers: 4.50.3
PyTorch: 2.6.0+cu124
Accelerate: 1.5.2
Datasets: 3.5.0
Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: -

Safetensors

Model size

22.7M params

Tensor type

F32

Model tree for ayushexel/embed-all-MiniLM-L6-v2-squad-1-epochs

Base model

sentence-transformers/all-MiniLM-L6-v2

Finetuned

(758)

this model

Papers for ayushexel/embed-all-MiniLM-L6-v2-squad-1-epochs

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 12

Efficient Natural Language Response Suggestion for Smart Reply

Paper • 1705.00652 • Published May 1, 2017

Evaluation results

Cosine Accuracy on gooqa dev
self-reported

0.399