tiffanychen commited on
Commit
9cea28a
·
verified ·
1 Parent(s): 713532d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +290 -122
README.md CHANGED
@@ -20,45 +20,63 @@ tags:
20
  - ophthalmology
21
  - chest-x-ray
22
  ---
23
- # MedSigLIP Model Card
24
 
25
- **Model documentation**:
26
- [MedSigLIP](https://developers.google.com/health-ai-developer-foundations/medsiglip)
27
 
28
  **Resources:**
29
 
30
- * Model on Google Cloud [Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/medsiglip)
31
- * [GitHub repository](https://github.com/google-health/medsiglip) (supporting code, Colab notebooks, discussions, and issues)
32
- * Quick start [notebook](https://colab.research.google.com/github/google-health/medsiglip/blob/main/notebooks/quick_start_with_hugging_face.ipynb)
33
- * Fine-tuning [notebook](https://colab.research.google.com/github/google-health/medsiglip/blob/main/notebooks/fine_tune_with_hugging_face.ipynb)
34
- * Support: See [Contact](https://developers.google.com/health-ai-developer-foundations/medsiglip/get-started.md#contact)
35
- * License: The use of MedSigLIP is governed by the [Health AI Developer Foundations terms of use](https://developers.google.com/health-ai-developer-foundations/terms).
 
 
 
 
 
36
 
37
  **Author:** Google
38
 
39
- **Model information**
40
 
41
  This section describes the MedSigLIP model and how to use it.
42
 
43
- **Description**
44
 
45
- MedSigLIP is a variant of [SigLIP](https://arxiv.org/abs/2303.15343) (Sigmoid Loss for Language Image Pre-training) that is trained to encode medical images and text into a common embedding space. Developers can use MedSigLIP to accelerate building healthcare-based AI applications. MedSigLIP contains a 400M parameter vision encoder and 400M parameter text encoder, it supports 448x448 image resolution with up to 64 text tokens.
 
 
 
 
 
46
 
47
- MedSigLIP was trained on a variety of de-identified medical image and text pairs, including chest X-rays, dermatology images, ophthalmology images, histopathology slides, and slices of CT and MRI volumes, along with associated descriptions or reports. This training data was combined with natural (non-medical) image and text pairs to retain MedSigLIP’s ability to parse natural images.
 
 
 
 
 
48
 
49
- MedSigLIP is recommended for medical image interpretation applications without a need for text generation, such as data-efficient classification, zero-shot classification, and semantic image retrieval. For medical applications that require text generation, [MedGemma](http://goo.gle/medgemma) is recommended.
 
 
 
50
 
51
- **How to use**
52
 
53
- * Get started using MedSigLIP via the Hugging Face and Google Cloud Model Garden links provided above, along with its accompanying documentation and demo notebooks.
 
 
 
54
 
55
- Below are some example code snippets to help you quickly get started running the MedSigLIP model locally. If you want to use the model at scale, we recommend that you create a production version using [Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/medsiglip).
56
-
57
- ```
58
  import numpy as np
59
  from PIL import Image
60
  import requests
61
- from transformers import AutoProcessor, AutoModel, SiglipVisionModel
62
  from tensorflow.image import resize as tf_resize
63
  import torch
64
 
@@ -67,73 +85,90 @@ device = "cuda" if torch.cuda.is_available() else "cpu"
67
  model = AutoModel.from_pretrained("google/medsiglip-448").to(device)
68
  processor = AutoProcessor.from_pretrained("google/medsiglip-448")
69
 
70
-
71
  # Download sample image
72
- !wget -nc -q https://storage.googleapis.com/dx-scin-public-data/dataset/images/3445096909671059178.png
73
- !wget -nc -q https://storage.googleapis.com/dx-scin-public-data/dataset/images/-5669089898008966381.png
74
  imgs = [Image.open("3445096909671059178.png").convert("RGB"), Image.open("-5669089898008966381.png").convert("RGB")]
75
 
76
- # We recommend a resizing operation with tf.image.resize to match the
77
- # implementation with the Big Vision library
78
- # (https://github.com/google-research/big_vision/blob/0127fb6b337ee2a27bf4e54dea79cff176527356/big_vision/pp/ops_image.py#L84)
 
 
 
 
79
  def resize(image):
80
- return Image.fromarray(tf_resize(images=image,
81
- size=[448, 448],
82
- method='bilinear',
83
- antialias=False).numpy().astype(np.uint8))
 
 
84
 
85
  resized_imgs = [resize(img) for img in imgs]
86
 
87
- texts = ["a photo of an arm with no rash",
88
- "a photo of an arm with a rash",
89
- "a photo of a leg with no rash",
90
- "a photo of a leg with a rash"]
 
 
91
 
92
  inputs = processor(text=texts, images=resized_imgs, padding="max_length", return_tensors="pt").to(device)
93
 
94
  with torch.no_grad():
95
- outputs = model(**inputs)
96
 
97
  logits_per_image = outputs.logits_per_image
98
  probs = torch.softmax(logits_per_image, dim=1)
99
 
100
  for n_img, img in enumerate(imgs):
101
- display(img) # Note this is an IPython function that will only work in a Jupyter notebook environment
102
- for i, label in enumerate(texts):
103
- print(f"{probs[n_img][i]:.2%} that image is '{label}'")
104
 
105
- # We can also get the actual embeddings
106
- print(f"\nimage embeddings: {outputs.image_embeds}")
107
- print(f"\ntext embeddings: {outputs.text_embeds}")
108
  ```
109
 
110
- **Examples**
111
 
112
  See the following Colab notebooks for examples of how to use MedSigLIP:
113
 
114
- * To give the model a quick try running it locally with weights from Hugging Face, see [Quick start notebook in Colab](https://colab.research.google.com/github/google-health/medsiglip/blob/main/notebooks/quick_start_with_hugging_face.ipynb).
115
- * For an example of fine-tuning the model, see the [Fine-tuning notebook in Colab](https://colab.research.google.com/github/google-health/medsiglip/blob/main/notebooks/fine_tune_with_hugging_face.ipynb)
 
116
 
117
- **Model architecture overview**
 
118
 
119
- MedSigLIP is based on SigLIP-400M ([Zhai et al., 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Zhai_Sigmoid_Loss_for_Language_Image_Pre-Training_ICCV_2023_paper.html)) and is the same encoder that powers image interpretation in the [MedGemma](http://goo.gle/medgemma) generative model. MedSigLIP’s image component is a 400M vision transformer and its text component is a 400M text transformer.
120
 
121
- ## **Technical Specifications**
 
 
 
 
 
122
 
123
- * Model type: Two tower encoder architecture comprised of a vision transformer and text transformer
124
- * Image resolution: 448 x 448
125
- * Context length: 64 tokens
126
- * Modalities: Image, text
127
- * Key publication: https://arxiv.org/abs/2507.05201
128
- * Model created: July 9, 2025
129
- * Model Version: 1.0.0
130
 
 
 
 
 
 
 
 
 
131
 
132
- **Citation**
133
- When using this model, please cite:
134
- Sellergren, Andrew, et al. "MedGemma Technical Report." *arXiv preprint arXiv:2507.05201* (2025).
135
 
136
- ```
 
 
 
137
  @article{sellergren2025medgemma,
138
  title={MedGemma Technical Report},
139
  author={Sellergren, Andrew and Kazemzadeh, Sahar and Jaroensri, Tiam and Kiraly, Atilla and Traverse, Madeleine and Kohlberger, Timo and Xu, Shawn and Jamil, Fayaz and Hughes, Cían and Lau, Charles and others},
@@ -142,27 +177,36 @@ Sellergren, Andrew, et al. "MedGemma Technical Report." *arXiv preprint arXiv:25
142
  }
143
  ```
144
 
145
- **Inputs and outputs**
146
 
147
- **Input:**
148
 
149
  MedSigLIP accepts images and text as inputs.
150
 
151
- * Images, normalized to values in the range (-1, 1\) and to 448 x 448 resolution
152
- * Text string, such as a caption or candidate classification label
 
153
 
154
- **Output:**
155
 
156
- * Image embedding if input image is provided
157
- * Text embedding if input text is provided
158
- * Similarity score between the image and text
159
 
160
- **Performance and validation**
161
- MedSigLIP was evaluated across a range of medical image modalities, focusing on chest X-ray, pathology, dermatology and ophthalmology.
162
 
163
- The following table summarizes zero-shot AUCs for Chest X-Ray Findings with Med-SigLIP and ELIXR ([Xu et al., 2023](https://arxiv.org/abs/2308.01317)), based on CXR evaluation data from ELIXR. In all cases, 518 examples were used for 2-class classification. Note that MedSigLIP accepts inputs of size 448x448 while ELIXR accepts inputs of size 1280x1280.
 
164
 
165
- | Finding | Med-SigLIP Zero-Shot | ELIXR Zero-Shot\* |
 
 
 
 
 
 
 
 
166
  | :---- | ----- | ----- |
167
  | Enlarged Cardiomediastinum | 0.858 | 0.800 |
168
  | Cardiomegaly | 0.904 | 0.891 |
@@ -179,9 +223,15 @@ The following table summarizes zero-shot AUCs for Chest X-Ray Findings with Med-
179
  | Support Devices | 0.852 | 0.894 |
180
  | **Average** | **0.844** | **0.824** |
181
 
182
- \*Prior reported results from ([Xu et al., 2023](https://arxiv.org/abs/2308.01317))
 
183
 
184
- The following table summarizes AUCs for Dermatology, Ophthalmology, and Pathology Findings with Med-SigLIP compared to existing HAI-DEF embeddingmodels (Derm Foundation and Path Foundation, [goo.gle/hai-def](http://goo.gle/hai-def)). Note that MedSigLIP accepts inputs of size 448x448 while Derm Foundation accepts inputs of size 448x448 and Path Foundation accepts inputs of size 224x224.
 
 
 
 
 
185
 
186
  | Domain | Finding | Size | Num Classes | Med-SigLIP Zero-Shot | Med-SigLIP Linear Probe | HAI-DEF Linear Probe\* |
187
  | :---- | :---- | ----- | ----- | ----- | ----- | ----- |
@@ -197,59 +247,136 @@ The following table summarizes AUCs for Dermatology, Ophthalmology, and Patholog
197
  | | Tissue Types | 5000 | 16 | 0.930 | 0.972 | 0.947 |
198
  | **Average** | | | | **0.870** | **0.878** | **0.897** |
199
 
200
- \* HAI-DEF pathology results are based on prior reported results from [Yang et al., 2024](https://arxiv.org/abs/2405.03162).
 
201
 
202
- **Data card**
203
- **Training**
204
- MedSigLIP was trained on a variety of de-identified medical image and text pairs, including chest X-rays, dermatology images, ophthalmology images, histopathology slides, and slices of CT and MRI volumes, along with associated descriptions or reports. This training data was combined with natural (non-medical) image and text pairs to retain MedSigLIP’s ability to parse natural images.
205
 
206
- **Evaluation**
207
- MedSigLIP has been evaluated on a comprehensive set of evaluation datasets on 23 tasks across 4 modalities and benchmarked against modality-specific HAI-DEF models from Google.
208
-
209
- **Source**
210
- MedSigLIP training utilized a combination of public and private datasets.
211
 
212
- This model was trained on diverse public datasets including MIMIC-CXR (chest X-rays and reports), Slake-VQA, PAD-UFES-20 (skin lesion images and data), SCIN (dermatology images), TCGA (cancer genomics data), CAMELYON (lymph node histopathology images), PMC-OA (biomedical literature with images), and Mendeley Digital Knee X-Ray (knee X-rays).
213
 
214
- Additionally, multiple diverse proprietary datasets were licensed and incorporated (described next).
 
 
 
 
 
215
 
216
- **Data Ownership and Documentation**
217
 
218
- * [MIMIC-CXR](https://physionet.org/content/mimic-cxr/2.1.0/): MIT Laboratory for Computational Physiology and Beth Israel Deaconess Medical Center (BIDMC).
219
- * [SLAKE](https://www.med-vqa.com/slake/): The Hong Kong Polytechnic University (PolyU), with collaborators including West China Hospital of Sichuan University and Sichuan Academy of Medical Sciences / Sichuan Provincial People's Hospital.
220
- * [PAD-UFES-20](https://pmc.ncbi.nlm.nih.gov/articles/PMC7479321/): Federal University of Espírito Santo (UFES), Brazil, through its Dermatological and Surgical Assistance Program (PAD).
221
- * [SCIN](https://github.com/google-research-datasets/scin): A collaboration between Google Health and Stanford Medicine.
222
- * [TCGA](https://portal.gdc.cancer.gov/) (The Cancer Genome Atlas): A joint effort of National Cancer Institute and National Human Genome Research Institute. Data from TCGA are available via the Genomic Data Commons (GDC)
223
- * [CAMELYON](https://camelyon17.grand-challenge.org/Data/): The data was collected from Radboud University Medical Center and University Medical Center Utrecht in the Netherlands.
224
- * [PMC-OA (PubMed Central Open Access Subset)](https://catalog.data.gov/dataset/pubmed-central-open-access-subset-pmc-oa): Maintained by the National Library of Medicine (NLM) and National Center for Biotechnology Information (NCBI), which are part of the NIH.
225
- * [Mendeley Digital Knee X-Ray](https://data.mendeley.com/datasets/t9ndx37v5h/1): This dataset is from Rani Channamma University, and is hosted on Mendeley Data.
226
 
 
227
 
228
- **In addition to the public datasets listed above, MedSigLIP was also trained on de-identified, licensed datasets or datasets collected internally at Google from consented participants.**
229
 
230
- * **Radiology dataset 1:** De-identified dataset of different CT and MRI studies across body parts from a US-based radiology outpatient diagnostic center network.
231
- * **Ophthalmology dataset 1 (EyePACS):** De-identified dataset of fundus images from diabetic retinopathy screening.
232
- * **Dermatology dataset 1:** De-identified dataset of teledermatology skin condition images (both clinical and dermatoscopic) from Colombia.
233
- * **Dermatology dataset 2:** De-identified dataset of skin cancer images (both clinical and dermatoscopic) from Australia.
234
- * **Dermatology dataset 3:** De-identified dataset of non-diseased skin images from an internal data collection effort.
235
- * **Pathology dataset 1:** De-identified dataset of histopathology H\&E whole slide images created in collaboration with an academic research hospital and biobank in Europe. Comprises de-identified colon, prostate, and lymph nodes.
236
- * **Pathology dataset 2:** De-identified dataset of lung histopathology H\&E and IHC whole slide images created by a commercial biobank in the United States.
237
- * **Pathology dataset 3:** De-identified dataset of prostate and lymph node H\&E and IHC histopathology whole slide images created by a contract research organization in the United States.
238
- * **Pathology dataset 4:** De-identified dataset of histopathology whole slide images created in collaboration with a large, tertiary teaching hospital in the United States. Comprises a diverse set of tissue and stain types, predominantly H\&E.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
239
 
240
  ### Data citation
241
 
242
- * **MIMIC-CXR:** Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2024). MIMIC-CXR Database (version 2.1.0). PhysioNet. https://physionet.org/content/mimic-cxr/2.1.0/ *and* Johnson, Alistair E. W., Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-Ying Deng, Roger G. Mark, and Steven Horng. 2019\. "MIMIC-CXR, a de-Identified Publicly Available Database of Chest Radiographs with Free-Text Reports." *Scientific Data 6* (1): 1–8.
243
- * **SLAKE:** Liu, Bo, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. 2021.SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering." http://arxiv.org/abs/2102.09542.
244
- * **PAD-UEFS-20:** Pacheco, Andre GC, et al. "PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones." Data in brief 32 (2020): 106221\.
245
- * **SCIN:** Ward, Abbi, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley Carrick, Bilson Campana, Jay Hartford, et al. 2024\. "Creating an Empirical Dermatology Dataset Through Crowdsourcing With Web Search Advertisements." *JAMA Network Open 7* (11): e2446615–e2446615.
246
- * **TCGA:** The results shown here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.
247
- * **CAMELYON16:** Ehteshami Bejnordi, Babak, Mitko Veta, Paul Johannes van Diest, Bram van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen A. W. M. van der Laak, et al. 2017\. "Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer." *JAMA 318* (22): 2199–2210.
248
- * **Mendeley Digital Knee X-Ray:** Gornale, Shivanand; Patravali, Pooja (2020), "Digital Knee X-ray Images", Mendeley Data, V1, doi: 10.17632/t9ndx37v5h.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
249
 
250
  ### De-identification/anonymization:
251
 
252
- Google and partnerships utilize datasets that have been rigorously anonymized or de-identified to ensure the protection of individual research participants and patient privacy
 
 
253
 
254
  ## Implementation information
255
 
@@ -259,29 +386,70 @@ Details about the model internals.
259
 
260
  Training was done using [JAX](https://github.com/jax-ml/jax).
261
 
262
- JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models.
 
263
 
264
  ## Use and limitations
265
 
266
  ### Intended use
267
 
268
- MedSigLIP is a machine learning-based software development tool that generates numerical representations from input images and associated text. These representations are referred to as embeddings. MedSigLIP is designed for use by software developers and researchers to facilitate the creation and development of third-party healthcare applications that involve medical images and text. MedSigLIP itself does not provide any medical functionality, nor is it intended to process or interpret medical data for a medical purpose. MedSigLIP is a software development tool and is not a finished product. Developers are responsible for training, adapting, and making meaningful changes to MedSigLip to accomplish their specific intended use.
269
-
270
- The embeddings that MedSigLIP generates can be used for downstream tasks such as classification, regression, and semantic search. Numerical scores based on calculations performed on the embeddings can be thresholded for classification, or semantic search use-cases, allowing developers to control for precision and recall. Embedding-based models enable developers to create solutions that can be more compute efficient for fine-tuning classification tasks, such as training classifiers.. Thus, MedSigLIP is recommended for applications requiring strong classification performance without the need for text generation. MedSigLIP has been specifically pre-trained on a variety of de-identified pairs of medical images and text, including chest X-rays, CT slices, MRI slices, dermatology images, ophthalmology images, and histopathology patches. MedSigLip is intended to be used by software developers, to be adapted for use in image based applications in healthcare domains such as radiology, pathology, ophthalmology, and dermatology.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
271
 
272
  ### Benefits
273
 
274
- * Provides strong baseline medical image and text encodings.
275
- * Lightweight model that can be used in settings with limited high-bandwidth memory accelerator access.
276
- * MedSigLIP’s strong performance makes it efficient to adapt for downstream healthcare-based use cases, compared to models of similar size without medical data pre-training.
 
 
 
277
 
278
  ### Limitations
279
 
280
- MedSigLIP is not intended to be used without appropriate validation, adaptation, and/or making meaningful modification by developers for their specific use case. Without the above, outputs generated by the MedSigLip model are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications. Any software application developed using MedSigLip that is intended for a medical purpose must be independently validated and is subject to its own regulatory requirements
281
-
282
- MedSigLIP is not intended to be used without appropriate validation, adaptation and/or making meaningful modification by developers for their specific use case. The outputs generated by MedSigLIP are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications. Performance benchmarks highlight baseline capabilities, but even for image and text domains that constitute a substantial portion of training data, inaccurate model output is possible. All outputs from MedSigLIP should be considered preliminary and require independent verification, clinical correlation, and further investigation through established research and development methodologies.
 
 
 
 
 
283
 
284
  When adapting MedSigLIP developer should consider the following:
285
 
286
- * **Bias in validation data:** As with any research, developers should ensure that any downstream application is validated to understand performance using data that is appropriately representative of the intended use setting for the specific application (e.g., age, sex, gender, condition, imaging device, etc).
287
- * **Data contamination concerns**: When evaluating the generalization capabilities of a model like MedSigLIP in a medical context, there is a risk of data contamination, where the model might have inadvertently seen related medical information during its pre-training, potentially overestimating its true ability to generalize to novel medical concepts. Developers should validate MedSigLIP on datasets not publicly available or otherwise made available to non-institutional researchers to mitigate this risk.
 
 
 
 
 
 
 
 
 
 
 
20
  - ophthalmology
21
  - chest-x-ray
22
  ---
23
+ # MedSigLIP model card
24
 
25
+ **Model documentation:** [MedSigLIP](https://developers.google.com/health-ai-developer-foundations/medsiglip)
 
26
 
27
  **Resources:**
28
 
29
+ * Model on Google Cloud Model Garden: [MedSigLIP](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/medsiglip)
30
+ * Model on Hugging Face: [MedSigLIP](https://huggingface.co/google/medsiglip-448)
31
+ * GitHub repository (supporting code, Colab notebooks, discussions, and
32
+ issues): [MedSigLIP](https://github.com/google-health/medsiglip)
33
+ * Quick start notebook:
34
+ [GitHub](https://github.com/google-health/medsiglip/blob/main/notebooks/quick_start_with_hugging_face.ipynb)
35
+ * Fine-tuning notebook: [GitHub](https://github.com/google-health/medsiglip/blob/main/notebooks/fine_tune_with_hugging_face.ipynb)
36
+ * Support: See [Contact](https://developers.google.com/health-ai-developer-foundations/medsiglip/get-started.md#contact)
37
+ * License: The use of MedSigLIP is governed by the [Health AI Developer
38
+ Foundations terms of
39
+ use](https://developers.google.com/health-ai-developer-foundations/terms).
40
 
41
  **Author:** Google
42
 
43
+ ## Model information
44
 
45
  This section describes the MedSigLIP model and how to use it.
46
 
47
+ ### Description
48
 
49
+ MedSigLIP is a variant of [SigLIP](https://arxiv.org/abs/2303.15343) (Sigmoid
50
+ Loss for Language Image Pre-training) that is trained to encode medical images
51
+ and text into a common embedding space. Developers can use MedSigLIP to
52
+ accelerate building healthcare-based AI applications. MedSigLIP contains a 400M
53
+ parameter vision encoder and 400M parameter text encoder, it supports 448x448
54
+ image resolution with up to 64 text tokens.
55
 
56
+ MedSigLIP was trained on a variety of de-identified medical image and text
57
+ pairs, including chest X-rays, dermatology images, ophthalmology images,
58
+ histopathology slides, and slices of CT and MRI volumes, along with associated
59
+ descriptions or reports. This training data was combined with natural
60
+ (non-medical) image and text pairs to retain MedSigLIP's ability to parse
61
+ natural images.
62
 
63
+ MedSigLIP is recommended for medical image interpretation applications without a
64
+ need for text generation, such as data-efficient classification, zero-shot
65
+ classification, and semantic image retrieval. For medical applications that
66
+ require text generation, [MedGemma](http://goo.gle/medgemma) is recommended.
67
 
68
+ ### How to use
69
 
70
+ Below are some example code snippets to help you quickly get started running the
71
+ MedSigLIP model locally. If you want to use the model at scale, we recommend
72
+ that you create a production version using [Model
73
+ Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/medsiglip).
74
 
75
+ ```python
 
 
76
  import numpy as np
77
  from PIL import Image
78
  import requests
79
+ from transformers import AutoProcessor, AutoModel
80
  from tensorflow.image import resize as tf_resize
81
  import torch
82
 
 
85
  model = AutoModel.from_pretrained("google/medsiglip-448").to(device)
86
  processor = AutoProcessor.from_pretrained("google/medsiglip-448")
87
 
 
88
  # Download sample image
89
+ ! wget -nc -q https://storage.googleapis.com/dx-scin-public-data/dataset/images/3445096909671059178.png
90
+ ! wget -nc -q https://storage.googleapis.com/dx-scin-public-data/dataset/images/-5669089898008966381.png
91
  imgs = [Image.open("3445096909671059178.png").convert("RGB"), Image.open("-5669089898008966381.png").convert("RGB")]
92
 
93
+
94
+ # If you want to reproduce the results from MedSigLIP evals, we recommend a
95
+ # resizing operation with `tf.image.resize` to match the implementation with the
96
+ # Big Vision library (https://github.com/google-research/big_vision/blob/0127fb6b337ee2a27bf4e54dea79cff176527356/big_vision/pp/ops_image.py#L84).
97
+ # Otherwise, you can rely on the Transformers image processor's built-in
98
+ # resizing (done automatically by default and uses `PIL.Image.resize`) or use
99
+ # another resizing method.
100
  def resize(image):
101
+ return Image.fromarray(
102
+ tf_resize(
103
+ images=image, size=[448, 448], method='bilinear', antialias=False
104
+ ).numpy().astype(np.uint8)
105
+ )
106
+
107
 
108
  resized_imgs = [resize(img) for img in imgs]
109
 
110
+ texts = [
111
+ "a photo of an arm with no rash",
112
+ "a photo of an arm with a rash",
113
+ "a photo of a leg with no rash",
114
+ "a photo of a leg with a rash"
115
+ ]
116
 
117
  inputs = processor(text=texts, images=resized_imgs, padding="max_length", return_tensors="pt").to(device)
118
 
119
  with torch.no_grad():
120
+ outputs = model(**inputs)
121
 
122
  logits_per_image = outputs.logits_per_image
123
  probs = torch.softmax(logits_per_image, dim=1)
124
 
125
  for n_img, img in enumerate(imgs):
126
+ display(img) # Note this is an IPython function that will only work in a Jupyter notebook environment
127
+ for i, label in enumerate(texts):
128
+ print(f"{probs[n_img][i]:.2%} that image is '{label}'")
129
 
130
+ # Get the image and text embeddings
131
+ print(f"image embeddings: {outputs.image_embeds}")
132
+ print(f"text embeddings: {outputs.text_embeds}")
133
  ```
134
 
135
+ ### Examples
136
 
137
  See the following Colab notebooks for examples of how to use MedSigLIP:
138
 
139
+ * To give the model a quick try, running it locally with weights from Hugging
140
+ Face, see [Quick start notebook in
141
+ Colab](https://colab.research.google.com/github/google-health/medsiglip/blob/main/notebooks/quick_start_with_hugging_face.ipynb).
142
 
143
+ * For an example of fine-tuning the model, see the [Fine-tuning notebook in
144
+ Colab](https://colab.research.google.com/github/google-health/medsiglip/blob/main/notebooks/fine_tune_with_hugging_face.ipynb).
145
 
146
+ ### Model architecture overview
147
 
148
+ MedSigLIP is based on SigLIP-400M ([Zhai et al.,
149
+ 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Zhai_Sigmoid_Loss_for_Language_Image_Pre-Training_ICCV_2023_paper.html))
150
+ and is the same encoder that powers image interpretation in the
151
+ [MedGemma](http://goo.gle/medgemma) generative model. MedSigLIP's image
152
+ component is a 400M vision transformer and its text component is a 400M text
153
+ transformer.
154
 
155
+ ### Technical specifications
 
 
 
 
 
 
156
 
157
+ * **Model type**: Two tower encoder architecture comprised of a vision
158
+ transformer and text transformer
159
+ * **Image resolution**: 448 x 448
160
+ * **Context length**: 64 tokens
161
+ * **Modalities**: Image, text
162
+ * **Key publication**: [https://arxiv.org/abs/2507.05201](https://arxiv.org/abs/2507.05201)
163
+ * **Model created**: July 9, 2025
164
+ * **Model version**: 1.0.0
165
 
166
+ ### Citation
 
 
167
 
168
+ When using this model, please cite: Sellergren, Andrew, et al. "MedGemma
169
+ Technical Report." *arXiv preprint arXiv:2507.05201* (2025).
170
+
171
+ ```none
172
  @article{sellergren2025medgemma,
173
  title={MedGemma Technical Report},
174
  author={Sellergren, Andrew and Kazemzadeh, Sahar and Jaroensri, Tiam and Kiraly, Atilla and Traverse, Madeleine and Kohlberger, Timo and Xu, Shawn and Jamil, Fayaz and Hughes, Cían and Lau, Charles and others},
 
177
  }
178
  ```
179
 
180
+ ### Inputs and outputs
181
 
182
+ **Input**:
183
 
184
  MedSigLIP accepts images and text as inputs.
185
 
186
+ * Images, normalized to values in the range (-1, 1\) and to 448 x 448
187
+ resolution
188
+ * Text string, such as a caption or candidate classification label
189
 
190
+ **Output**:
191
 
192
+ * Image embedding if input image is provided
193
+ * Text embedding if input text is provided
194
+ * Similarity score between the image and text
195
 
196
+ ### Performance and validation
 
197
 
198
+ MedSigLIP was evaluated across a range of medical image modalities, focusing on
199
+ chest X-ray, pathology, dermatology and ophthalmology.
200
 
201
+ ### Key performance metrics
202
+
203
+ The following table summarizes zero-shot AUCs for Chest X-Ray Findings with
204
+ Med-SigLIP and ELIXR ([Xu et al., 2023](https://arxiv.org/abs/2308.01317)),
205
+ based on CXR evaluation data from ELIXR. In all cases, 518 examples were used
206
+ for 2-class classification. Note that MedSigLIP accepts inputs of size 448x448
207
+ while ELIXR accepts inputs of size 1280x1280.
208
+
209
+ | Finding | Med-SigLIP Zero-Shot | ELIXR Zero-Shot* |
210
  | :---- | ----- | ----- |
211
  | Enlarged Cardiomediastinum | 0.858 | 0.800 |
212
  | Cardiomegaly | 0.904 | 0.891 |
 
223
  | Support Devices | 0.852 | 0.894 |
224
  | **Average** | **0.844** | **0.824** |
225
 
226
+ *Prior reported results from ([Xu et al.,
227
+ 2023](https://arxiv.org/abs/2308.01317))
228
 
229
+ The following table summarizes AUCs for Dermatology, Ophthalmology, and
230
+ Pathology Findings with Med-SigLIP compared to existing HAI-DEF embedding models
231
+ (Derm Foundation and Path Foundation,
232
+ [goo.gle/hai-def](http://goo.gle/hai-def)). Note that MedSigLIP accepts inputs
233
+ of size 448x448 while Derm Foundation accepts inputs of size 448x448 and Path
234
+ Foundation accepts inputs of size 224x224.
235
 
236
  | Domain | Finding | Size | Num Classes | Med-SigLIP Zero-Shot | Med-SigLIP Linear Probe | HAI-DEF Linear Probe\* |
237
  | :---- | :---- | ----- | ----- | ----- | ----- | ----- |
 
247
  | | Tissue Types | 5000 | 16 | 0.930 | 0.972 | 0.947 |
248
  | **Average** | | | | **0.870** | **0.878** | **0.897** |
249
 
250
+ *HAI-DEF pathology results are based on prior reported results from [Yang et
251
+ al., 2024](https://arxiv.org/abs/2405.03162).
252
 
253
+ ## Data card
 
 
254
 
255
+ ### Dataset overview
 
 
 
 
256
 
257
+ #### Training
258
 
259
+ MedSigLIP was trained on a variety of de-identified medical image and text
260
+ pairs, including chest X-rays, dermatology images, ophthalmology images,
261
+ histopathology slides, and slices of CT and MRI volumes, along with associated
262
+ descriptions or reports. This training data was combined with natural
263
+ (non-medical) image and text pairs to retain MedSigLIP's ability to parse
264
+ natural images.
265
 
266
+ #### Evaluation
267
 
268
+ MedSigLIP has been evaluated on a comprehensive set of evaluation datasets on 23
269
+ tasks across 4 modalities and benchmarked against modality-specific HAI-DEF
270
+ models from Google.
 
 
 
 
 
271
 
272
+ #### Source
273
 
274
+ MedSigLIP training utilized a combination of public and private datasets.
275
 
276
+ This model was trained on diverse public datasets including MIMIC-CXR (chest
277
+ X-rays and reports), Slake-VQA, PAD-UFES-20 (skin lesion images and data), SCIN
278
+ (dermatology images), TCGA (cancer genomics data), CAMELYON (lymph node
279
+ histopathology images), PMC-OA (biomedical literature with images), and Mendeley
280
+ Digital Knee X-Ray (knee X-rays).
281
+
282
+ Additionally, multiple diverse proprietary datasets were licensed and
283
+ incorporated (described next).
284
+
285
+ ### Data ownership and documentation
286
+
287
+ * [MIMIC-CXR](https://physionet.org/content/mimic-cxr/2.1.0/): MIT Laboratory
288
+ for Computational Physiology and Beth Israel Deaconess Medical Center
289
+ (BIDMC).
290
+ * [Slake-VQA](https://www.med-vqa.com/slake/): The Hong Kong Polytechnic
291
+ University (PolyU), with collaborators including West China Hospital of
292
+ Sichuan University and Sichuan Academy of Medical Sciences / Sichuan
293
+ Provincial People's Hospital.
294
+ * [PAD-UFES-20](https://pmc.ncbi.nlm.nih.gov/articles/PMC7479321/): Federal
295
+ University of Espírito Santo (UFES), Brazil, through its Dermatological and
296
+ Surgical Assistance Program (PAD).
297
+ * [SCIN](https://github.com/google-research-datasets/scin): A collaboration
298
+ between Google Health and Stanford Medicine.
299
+ * [TCGA](https://portal.gdc.cancer.gov/) (The Cancer Genome Atlas): A joint
300
+ effort of National Cancer Institute and National Human Genome Research
301
+ Institute. Data from TCGA are available via the Genomic Data Commons (GDC)
302
+ * [CAMELYON](https://camelyon17.grand-challenge.org/Data/): The data was
303
+ collected from Radboud University Medical Center and University Medical
304
+ Center Utrecht in the Netherlands.
305
+ * [PMC-OA (PubMed Central Open Access
306
+ Subset)](https://catalog.data.gov/dataset/pubmed-central-open-access-subset-pmc-oa):
307
+ Maintained by the National Library of Medicine (NLM) and National Center for
308
+ Biotechnology Information (NCBI), which are part of the NIH.
309
+ * [MedQA](https://arxiv.org/pdf/2009.13081): This dataset was created by a
310
+ team of researchers led by Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung
311
+ Weng, Hanyi Fang, and Peter Szolovits
312
+ * [Mendeley Digital Knee
313
+ X-Ray](https://data.mendeley.com/datasets/t9ndx37v5h/1): This dataset is
314
+ from Rani Channamma University, and is hosted on Mendeley Data.
315
+
316
+ In addition to the public datasets listed above, MedSigLIP was also trained on
317
+ de-identified, licensed datasets or datasets collected internally at Google from
318
+ consented participants.
319
+
320
+ * **Radiology dataset 1:** De-identified dataset of different CT and MRI
321
+ studies across body parts from a US-based radiology outpatient diagnostic
322
+ center network.
323
+ * **Ophthalmology dataset 1 (EyePACS):** De-identified dataset of fundus
324
+ images from diabetic retinopathy screening.
325
+ * **Dermatology dataset 1:** De-identified dataset of teledermatology skin
326
+ condition images (both clinical and dermatoscopic) from Colombia.
327
+ * **Dermatology dataset 2:** De-identified dataset of skin cancer images (both
328
+ clinical and dermatoscopic) from Australia.
329
+ * **Dermatology dataset 3:** De-identified dataset of non-diseased skin images
330
+ from an internal data collection effort.
331
+ * **Pathology dataset 1:** De-identified dataset of histopathology H\&E whole
332
+ slide images created in collaboration with an academic research hospital and
333
+ biobank in Europe. Comprises de-identified colon, prostate, and lymph nodes.
334
+ * **Pathology dataset 2:** De-identified dataset of lung histopathology H\&E
335
+ and IHC whole slide images created by a commercial biobank in the United
336
+ States.
337
+ * **Pathology dataset 3:** De-identified dataset of prostate and lymph node
338
+ H\&E and IHC histopathology whole slide images created by a contract
339
+ research organization in the United States.
340
+ * **Pathology dataset 4:** De-identified dataset of histopathology whole slide
341
+ images created in collaboration with a large, tertiary teaching hospital in
342
+ the United States. Comprises a diverse set of tissue and stain types,
343
+ predominantly H\&E.
344
 
345
  ### Data citation
346
 
347
+ * **MIMIC-CXR:** Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng,
348
+ S. (2024). MIMIC-CXR Database (version 2.1.0). PhysioNet.
349
+ https://physionet.org/content/mimic-cxr/2.1.0/ *and* Johnson, Alistair E.
350
+ W., Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P.
351
+ Lungren, Chih-Ying Deng, Roger G. Mark, and Steven Horng. 2019\. "MIMIC-CXR,
352
+ a de-Identified Publicly Available Database of Chest Radiographs with
353
+ Free-Text Reports." *Scientific Data 6* (1): 1–8.
354
+ * **SLAKE:** Liu, Bo, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu.
355
+ 2021.SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical
356
+ Visual Question Answering." http://arxiv.org/abs/2102.09542.
357
+ * **PAD-UEFS-20:** Pacheco, Andre GC, et al. "PAD-UFES-20: A skin lesion
358
+ dataset composed of patient data and clinical images collected from
359
+ smartphones." Data in brief 32 (2020): 106221.
360
+ * **SCIN:** Ward, Abbi, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley
361
+ Carrick, Bilson Campana, Jay Hartford, et al. 2024. "Creating an Empirical
362
+ Dermatology Dataset Through Crowdsourcing With Web Search Advertisements."
363
+ *JAMA Network Open 7* (11): e2446615–e2446615.
364
+ * **TCGA:** The results shown here are in whole or part based upon data
365
+ generated by the TCGA Research Network: https://www.cancer.gov/tcga.
366
+ * **CAMELYON16:** Ehteshami Bejnordi, Babak, Mitko Veta, Paul Johannes van
367
+ Diest, Bram van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen A. W. M.
368
+ van der Laak, et al. 2017. "Diagnostic Assessment of Deep Learning
369
+ Algorithms for Detection of Lymph Node Metastases in Women With Breast
370
+ Cancer." *JAMA 318* (22): 2199–2210.
371
+ * **Mendeley Digital Knee X-Ray:** Gornale, Shivanand; Patravali, Pooja
372
+ (2020), "Digital Knee X-ray Images", Mendeley Data, V1, doi:
373
+ 10.17632/t9ndx37v5h.1
374
 
375
  ### De-identification/anonymization:
376
 
377
+ Google and its partners utilize datasets that have been rigorously anonymized or
378
+ de-identified to ensure the protection of individual research participants and
379
+ patient privacy.
380
 
381
  ## Implementation information
382
 
 
386
 
387
  Training was done using [JAX](https://github.com/jax-ml/jax).
388
 
389
+ JAX allows researchers to take advantage of the latest generation of hardware,
390
+ including TPUs, for faster and more efficient training of large models.
391
 
392
  ## Use and limitations
393
 
394
  ### Intended use
395
 
396
+ MedSigLIP is a machine learning-based software development tool that generates
397
+ numerical representations from input images and associated text. These
398
+ representations are referred to as embeddings. MedSigLIP is designed for use by
399
+ software developers and researchers to facilitate the creation and development
400
+ of third-party healthcare applications that involve medical images and text.
401
+ MedSigLIP itself does not provide any medical functionality, nor is it intended
402
+ to process or interpret medical data for a medical purpose. MedSigLIP is a
403
+ software development tool and is not a finished product. Developers are
404
+ responsible for training, adapting, and making meaningful changes to MedSigLip
405
+ to accomplish their specific intended use.
406
+
407
+ The embeddings that MedSigLIP generates can be used for downstream tasks such as
408
+ classification, regression, and semantic search. Numerical scores based on
409
+ calculations performed on the embeddings can be thresholded for classification,
410
+ or semantic search use-cases, allowing developers to control for precision and
411
+ recall. Embedding-based models enable developers to create solutions that can be
412
+ more compute efficient for fine-tuning classification tasks, such as training
413
+ classifiers.. Thus, MedSigLIP is recommended for applications requiring strong
414
+ classification performance without the need for text generation. MedSigLIP has
415
+ been specifically pre-trained on a variety of de-identified pairs of medical
416
+ images and text, including chest X-rays, CT slices, MRI slices, dermatology
417
+ images, ophthalmology images, and histopathology patches. MedSigLip is intended
418
+ to be used by software developers, to be adapted for use in image based
419
+ applications in healthcare domains such as radiology, pathology, ophthalmology,
420
+ and dermatology.
421
 
422
  ### Benefits
423
 
424
+ * Provides strong baseline medical image and text encodings.
425
+ * Lightweight model that can be used in settings with limited high-bandwidth
426
+ memory accelerator access.
427
+ * MedSigLIP's strong performance makes it efficient to adapt for downstream
428
+ healthcare-based use cases, compared to models of similar size without
429
+ medical data pre-training.
430
 
431
  ### Limitations
432
 
433
+ MedSigLIP is not intended to be used without appropriate validation, adaptation,
434
+ and/or making meaningful modification by developers for their specific use case.
435
+ Without the above, outputs generated by the MedSigLip model are not intended to
436
+ directly inform clinical diagnosis, patient management decisions, treatment
437
+ recommendations, or any other direct clinical practice applications. Any
438
+ software application developed using MedSigLip that is intended for a medical
439
+ purpose must be independently validated and is subject to its own regulatory
440
+ requirements.
441
 
442
  When adapting MedSigLIP developer should consider the following:
443
 
444
+ * **Bias in validation data:** As with any research, developers should ensure
445
+ that any downstream application is validated to understand performance using
446
+ data that is appropriately representative of the intended use setting for
447
+ the specific application (e.g., age, sex, gender, condition, imaging device,
448
+ etc).
449
+ * **Data contamination concerns**: When evaluating the generalization
450
+ capabilities of a model like MedSigLIP in a medical context, there is a risk
451
+ of data contamination, where the model might have inadvertently seen related
452
+ medical information during its pre-training, potentially overestimating its
453
+ true ability to generalize to novel medical concepts. Developers should
454
+ validate MedSigLIP on datasets not publicly available or otherwise made
455
+ available to non-institutional researchers to mitigate this risk.