embeddings usage / test-case
First I want to share with your team how grateful and happy I am to see this sort of model being built and provided.
It couldn't be overstated how useful this could become in many countries wrt lost/abandoned cats/dogs.
I gave it a quick shot on a small sample (234 pictures of 15 distinct, randomly chosen, dogs).
There is no precise indication about embedding actual usage, but as far as I understand it, model outputs isn't pixel-related in any way but rather but only a face / morphological / similarity ID isn't?
So I went with computing the torch.cosine_similarity() of the output tensor of every images pair, and for a given reference image, selected the 5 best matches over 0.93 (OK indicates this is the same individual as the reference in the first column. ERROR otherwise)
I'm not even sure whether I:
- should be amazed that it even work out of the box with nothing more involved than this with something as small as a 85MB model?
- should be disappointed by some of the blatant false-positive observed (where very distinct individual share a high
torch.cosine_similarity()) and many unmatched pairs for some pictures? - be hopeful that there are easy way to get far superior results?
But I'm mostly wondering whether I'm doing it right and/or any guidance you could kindly provide to make the best out of this model.
Thank you!
| ref | match_1 | match_2 | match_3 | match_4 | match_5 |
|---|---|---|---|---|---|
![]() |
OK 0.977 |
OK 0.945 |
|||
![]() |
OK 0.954 |
ERROR 0.939 |
|||
![]() |
OK 0.953 |
||||
![]() |
|||||
![]() |
ERROR 0.980 |
ERROR 0.976 |
ERROR 0.971 |
ERROR 0.956 |
ERROR 0.942 |
![]() |
OK 0.979 |
OK 0.974 |
OK 0.949 |
||
![]() |
|||||
![]() |
|||||
![]() |
|||||
![]() |
ERROR 0.939 |
||||
![]() |
OK 0.970 |
||||
![]() |
ERROR 0.969 |
ERROR 0.968 |
OK 0.954 |
||
![]() |
|||||
![]() |
OK 0.956 |
||||
![]() |
OK 0.987 |
ERROR 0.961 |
OK 0.959 |
ERROR 0.940 |
ERROR 0.937 |
![]() |
|||||
![]() |
ERROR 0.956 |
||||
![]() |
ERROR 0.931 |
||||
![]() |
ERROR 0.968 |
ERROR 0.947 |
ERROR 0.946 |
ERROR 0.940 |
|
![]() |
|||||
![]() |
OK 0.953 |
ERROR 0.949 |
|||
![]() |
|||||
![]() |
ERROR 0.968 |
ERROR 0.935 |
ERROR 0.933 |
||
![]() |
|||||
![]() |
ERROR 0.949 |
||||
![]() |
ERROR 0.932 |
||||
![]() |
ERROR 0.941 |
||||
![]() |
OK 1.000 |
ERROR 0.959 |
ERROR 0.947 |
ERROR 0.947 |
ERROR 0.940 |
![]() |
OK 0.946 |
OK 0.945 |
OK 0.937 |
||
![]() |
ERROR 0.964 |
||||
![]() |
ERROR 0.931 |
||||
![]() |
ERROR 0.981 |
ERROR 0.980 |
ERROR 0.966 |
ERROR 0.941 |
ERROR 0.938 |
![]() |
ERROR 0.964 |
OK 0.954 |
OK 0.945 |
||
![]() |
OK 1.000 |
ERROR 0.959 |
ERROR 0.947 |
ERROR 0.947 |
ERROR 0.940 |
![]() |
ERROR 0.961 |
ERROR 0.958 |
ERROR 0.930 |
||
![]() |
|||||
![]() |
OK 0.984 |
ERROR 0.956 |
ERROR 0.956 |
ERROR 0.955 |
OK 0.954 |
![]() |
ERROR 0.976 |
OK 0.959 |
ERROR 0.957 |
ERROR 0.956 |
ERROR 0.955 |
![]() |
|||||
![]() |
OK 0.995 |
OK 0.972 |
ERROR 0.968 |
OK 0.946 |
|
![]() |
ERROR 0.956 |
||||
![]() |
|||||
![]() |
OK 0.939 |
||||
![]() |
OK 0.956 |
||||
![]() |
ERROR 0.930 |
||||
![]() |
OK 0.987 |
ERROR 0.958 |
OK 0.950 |
ERROR 0.947 |
|
![]() |
OK 0.974 |
OK 0.974 |
OK 0.971 |
||
![]() |
OK 0.979 |
OK 0.971 |
OK 0.967 |
||
![]() |
OK 0.970 |
||||
![]() |
ERROR 0.941 |
||||
![]() |
OK 0.995 |
OK 0.975 |
ERROR 0.969 |
OK 0.945 |
|
![]() |
|||||
![]() |
ERROR 0.932 |
||||
![]() |
ERROR 0.949 |
||||
![]() |
OK 0.939 |
||||
![]() |
|||||
![]() |
OK 0.959 |
||||
![]() |
ERROR 0.948 |
ERROR 0.946 |
ERROR 0.935 |
||
![]() |
ERROR 0.967 |
ERROR 0.956 |
ERROR 0.956 |
||
![]() |
OK 0.980 |
OK 0.977 |
|||
![]() |
ERROR 0.948 |
||||
![]() |
|||||
![]() |
ERROR 0.981 |
ERROR 0.971 |
ERROR 0.950 |
ERROR 0.941 |
ERROR 0.940 |
![]() |
|||||
![]() |
OK 0.959 |
ERROR 0.930 |
|||
![]() |
OK 0.975 |
OK 0.972 |
OK 0.937 |
||
![]() |
OK 0.974 |
OK 0.967 |
OK 0.949 |
||
![]() |
OK 0.984 |
ERROR 0.967 |
ERROR 0.967 |
ERROR 0.964 |
ERROR 0.957 |
![]() |
|||||
![]() |
ERROR 0.949 |
ERROR 0.940 |
ERROR 0.940 |
ERROR 0.930 |
|
![]() |
ERROR 0.959 |
ERROR 0.959 |
ERROR 0.930 |
||
![]() |
ERROR 0.967 |
ERROR 0.956 |
ERROR 0.950 |
ERROR 0.950 |
ERROR 0.938 |
![]() |
ERROR 0.933 |
@drzraf
Thank you for your comment. We noticed the issue and fixed it. 🤝🔧
We've updated the description of how the model should be initialized in the README.
It should work better now, so, you may want to check this out – we expect better results!
Wow, it's way way better. I could reduce the similarity to 0.85 threshold and got, if not perfect (a couple of misses) at least very decent results !
Would you mind providing some more information model outputs. There structures, how to best used them, visualized(?) and consider from a high-level perspective?
Another questions, the models takes a list of PIL Image, but passing multiple arguments only returns one tensor. Am I doing something wrong? (I didn't tried a Dataset yet, but I believe it would have to work with just plain list to begin with).
Big thanks to your team!
Keep up the good work!
| ref | match_1 | match_2 | match_3 | match_4 | match_5 |
|---|---|---|---|---|---|
![]() |
OK 0.982 |
OK 0.977 |
OK 0.963 |
OK 0.937 |
OK 0.911 |
![]() |
OK 0.951 |
OK 0.912 |
OK 0.908 |
OK 0.893 |
|
![]() |
OK 0.907 |
||||
![]() |
OK 0.928 |
OK 0.905 |
OK 0.903 |
OK 0.874 |
|
![]() |
|||||
![]() |
OK 0.975 |
OK 0.950 |
OK 0.899 |
||
![]() |
OK 0.889 |
||||
![]() |
OK 0.930 |
OK 0.928 |
OK 0.909 |
OK 0.904 |
|
![]() |
OK 0.930 |
OK 0.925 |
OK 0.903 |
OK 0.885 |
|
![]() |
OK 0.936 |
OK 0.932 |
OK 0.906 |
OK 0.888 |
|
![]() |
OK 0.971 |
OK 0.966 |
OK 0.951 |
OK 0.883 |
|
![]() |
OK 0.948 |
OK 0.906 |
OK 0.893 |
OK 0.873 |
|
![]() |
OK 0.951 |
OK 0.947 |
OK 0.941 |
OK 0.921 |
|
![]() |
OK 0.932 |
OK 0.924 |
OK 0.915 |
OK 0.911 |
OK 0.908 |
![]() |
OK 0.934 |
OK 0.876 |
OK 0.868 |
||
![]() |
OK 0.921 |
OK 0.911 |
OK 0.909 |
OK 0.883 |
|
![]() |
OK 0.924 |
OK 0.912 |
OK 0.906 |
OK 0.894 |
|
![]() |
OK 0.881 |
||||
![]() |
|||||
![]() |
OK 0.931 |
OK 0.894 |
|||
![]() |
OK 0.907 |
||||
![]() |
|||||
![]() |
|||||
![]() |
OK 0.937 |
OK 0.932 |
OK 0.932 |
OK 0.930 |
OK 0.917 |
![]() |
OK 0.931 |
OK 0.916 |
|||
![]() |
OK 0.932 |
OK 0.925 |
OK 0.864 |
OK 0.862 |
|
![]() |
OK 0.909 |
OK 0.902 |
OK 0.885 |
OK 0.874 |
|
![]() |
OK 1.000 |
OK 0.909 |
|||
![]() |
OK 0.883 |
OK 0.871 |
|||
![]() |
OK 0.951 |
OK 0.924 |
OK 0.908 |
OK 0.873 |
|
![]() |
OK 0.932 |
OK 0.912 |
OK 0.900 |
||
![]() |
|||||
![]() |
OK 0.904 |
||||
![]() |
OK 1.000 |
OK 0.909 |
|||
![]() |
|||||
![]() |
OK 0.877 |
OK 0.853 |
|||
![]() |
OK 0.903 |
OK 0.853 |
|||
![]() |
OK 0.934 |
OK 0.916 |
OK 0.873 |
||
![]() |
OK 0.928 |
OK 0.877 |
|||
![]() |
OK 0.967 |
OK 0.965 |
OK 0.906 |
OK 0.870 |
|
![]() |
|||||
![]() |
OK 0.948 |
OK 0.908 |
OK 0.908 |
OK 0.894 |
|
![]() |
|||||
![]() |
OK 0.977 |
OK 0.969 |
OK 0.957 |
OK 0.930 |
OK 0.924 |
![]() |
OK 0.916 |
OK 0.894 |
|||
![]() |
OK 0.934 |
OK 0.916 |
OK 0.868 |
||
![]() |
OK 0.952 |
OK 0.950 |
|||
![]() |
OK 0.899 |
OK 0.891 |
|||
![]() |
OK 0.992 |
OK 0.966 |
OK 0.941 |
OK 0.909 |
|
![]() |
|||||
![]() |
OK 0.967 |
OK 0.933 |
OK 0.888 |
||
![]() |
OK 0.904 |
||||
![]() |
OK 0.928 |
OK 0.889 |
OK 0.881 |
OK 0.853 |
|
![]() |
|||||
![]() |
OK 0.863 |
||||
![]() |
OK 0.932 |
OK 0.900 |
OK 0.883 |
OK 0.870 |
|
![]() |
|||||
![]() |
OK 0.937 |
OK 0.853 |
|||
![]() |
|||||
![]() |
OK 0.969 |
OK 0.963 |
OK 0.935 |
OK 0.917 |
OK 0.908 |
![]() |
OK 0.992 |
OK 0.971 |
OK 0.947 |
OK 0.911 |
|
![]() |
OK 0.912 |
OK 0.864 |
OK 0.862 |
||
![]() |
OK 0.909 |
OK 0.909 |
|||
![]() |
|||||
![]() |
|||||
![]() |
OK 0.965 |
OK 0.936 |
OK 0.933 |
OK 0.900 |
OK 0.871 |
![]() |
OK 0.975 |
OK 0.952 |
OK 0.891 |
OK 0.870 |
|
![]() |
OK 0.937 |
OK 0.903 |
|||
![]() |
OK 0.870 |
||||
![]() |
OK 0.925 |
OK 0.905 |
OK 0.904 |
OK 0.902 |
|
![]() |
OK 0.936 |
OK 0.864 |
|||
![]() |
OK 0.934 |
OK 0.876 |
OK 0.873 |
||
![]() |
OK 0.936 |
OK 0.925 |
OK 0.900 |
OK 0.864 |
|
![]() |
OK 0.863 |
||||
![]() |
OK 0.982 |
OK 0.957 |
OK 0.935 |
OK 0.932 |
OK 0.915 |
Thanks a lot for the detailed feedback, it is great to hear that updated version gives you very reasonable results.
Regarding the model outputs: we discuss their structure, interpretation, and recommended ways to use them (including some visualization ideas) in a paper that is currently under review. As soon as the paper is published, we will add a link to it in the repository so that all the details are documented in one place.
About your second question: the behavior you see is expected. The model takes a list of PIL.Image objects, internally stacks them into a batch, and returns a single tensor of shape [batch_size, embedding_dim] where batch_size is the number of images in your list, which is the usual convention for PyTorch vision models. You may not have to use a Dataset for simple experiments; passing a plain list is perfectly fine. Here is a minimal example you can use to process a set of images:
import glob
from PIL import Image
import torch
import torch.nn.functional as F
paths = glob.glob("*.jpeg")
images = [Image.open(p).convert("RGB") for p in paths]
with torch.no_grad():
embeddings = model(images) # shape: [len(images), embedding_dim]
embeddings = F.normalize(embeddings, dim=1)
print(embeddings.shape)
Thanks again for the kind words and for taking the time to test the model so thoroughly. This kind of feedback is very helpful for us.

OK 0.977
OK 0.945
OK 0.954
ERROR 0.939
OK 0.953

ERROR 0.980
ERROR 0.976
ERROR 0.971
ERROR 0.956
ERROR 0.942
OK 0.979
OK 0.974
OK 0.949



OK 0.970
ERROR 0.969
ERROR 0.968

OK 0.956
OK 0.987
ERROR 0.961
ERROR 0.940

ERROR 0.956
ERROR 0.931
ERROR 0.968
ERROR 0.946
ERROR 0.949
ERROR 0.933

ERROR 0.949
ERROR 0.932
ERROR 0.941
OK 1.000
ERROR 0.959
OK 0.937
ERROR 0.964
OK 0.954
ERROR 0.956


OK 0.939
ERROR 0.930


ERROR 0.948
ERROR 0.956

