embeddings usage / test-case

#1
by drzraf - opened

First I want to share with your team how grateful and happy I am to see this sort of model being built and provided.

It couldn't be overstated how useful this could become in many countries wrt lost/abandoned cats/dogs.

I gave it a quick shot on a small sample (234 pictures of 15 distinct, randomly chosen, dogs).

There is no precise indication about embedding actual usage, but as far as I understand it, model outputs isn't pixel-related in any way but rather but only a face / morphological / similarity ID isn't?

So I went with computing the torch.cosine_similarity() of the output tensor of every images pair, and for a given reference image, selected the 5 best matches over 0.93 (OK indicates this is the same individual as the reference in the first column. ERROR otherwise)

I'm not even sure whether I:

  • should be amazed that it even work out of the box with nothing more involved than this with something as small as a 85MB model?
  • should be disappointed by some of the blatant false-positive observed (where very distinct individual share a high torch.cosine_similarity()) and many unmatched pairs for some pictures?
  • be hopeful that there are easy way to get far superior results?

But I'm mostly wondering whether I'm doing it right and/or any guidance you could kindly provide to make the best out of this model.

Thank you!

ref match_1 match_2 match_3 match_4 match_5
OK 0.977 OK 0.945
OK 0.954 ERROR 0.939
OK 0.953
ERROR 0.980 ERROR 0.976 ERROR 0.971 ERROR 0.956 ERROR 0.942
OK 0.979 OK 0.974 OK 0.949
ERROR 0.939
OK 0.970
ERROR 0.969 ERROR 0.968 OK 0.954
OK 0.956
OK 0.987 ERROR 0.961 OK 0.959 ERROR 0.940 ERROR 0.937
ERROR 0.956
ERROR 0.931
ERROR 0.968 ERROR 0.947 ERROR 0.946 ERROR 0.940
OK 0.953 ERROR 0.949
ERROR 0.968 ERROR 0.935 ERROR 0.933
ERROR 0.949
ERROR 0.932
ERROR 0.941
OK 1.000 ERROR 0.959 ERROR 0.947 ERROR 0.947 ERROR 0.940
OK 0.946 OK 0.945 OK 0.937
ERROR 0.964
ERROR 0.931
ERROR 0.981 ERROR 0.980 ERROR 0.966 ERROR 0.941 ERROR 0.938
ERROR 0.964 OK 0.954 OK 0.945
OK 1.000 ERROR 0.959 ERROR 0.947 ERROR 0.947 ERROR 0.940
ERROR 0.961 ERROR 0.958 ERROR 0.930
OK 0.984 ERROR 0.956 ERROR 0.956 ERROR 0.955 OK 0.954
ERROR 0.976 OK 0.959 ERROR 0.957 ERROR 0.956 ERROR 0.955
OK 0.995 OK 0.972 ERROR 0.968 OK 0.946
ERROR 0.956
OK 0.939
OK 0.956
ERROR 0.930
OK 0.987 ERROR 0.958 OK 0.950 ERROR 0.947
OK 0.974 OK 0.974 OK 0.971
OK 0.979 OK 0.971 OK 0.967
OK 0.970
ERROR 0.941
OK 0.995 OK 0.975 ERROR 0.969 OK 0.945
ERROR 0.932
ERROR 0.949
OK 0.939
OK 0.959
ERROR 0.948 ERROR 0.946 ERROR 0.935
ERROR 0.967 ERROR 0.956 ERROR 0.956
OK 0.980 OK 0.977
ERROR 0.948
ERROR 0.981 ERROR 0.971 ERROR 0.950 ERROR 0.941 ERROR 0.940
OK 0.959 ERROR 0.930
OK 0.975 OK 0.972 OK 0.937
OK 0.974 OK 0.967 OK 0.949
OK 0.984 ERROR 0.967 ERROR 0.967 ERROR 0.964 ERROR 0.957
ERROR 0.949 ERROR 0.940 ERROR 0.940 ERROR 0.930
ERROR 0.959 ERROR 0.959 ERROR 0.930
ERROR 0.967 ERROR 0.956 ERROR 0.950 ERROR 0.950 ERROR 0.938
ERROR 0.933
AvitoTech org

@drzraf
Thank you for your comment. We noticed the issue and fixed it. 🤝🔧

We've updated the description of how the model should be initialized in the README.
It should work better now, so, you may want to check this out – we expect better results!

Wow, it's way way better. I could reduce the similarity to 0.85 threshold and got, if not perfect (a couple of misses) at least very decent results !

Would you mind providing some more information model outputs. There structures, how to best used them, visualized(?) and consider from a high-level perspective?

Another questions, the models takes a list of PIL Image, but passing multiple arguments only returns one tensor. Am I doing something wrong? (I didn't tried a Dataset yet, but I believe it would have to work with just plain list to begin with).

Big thanks to your team!
Keep up the good work!

ref match_1 match_2 match_3 match_4 match_5
OK 0.982 OK 0.977 OK 0.963 OK 0.937 OK 0.911
OK 0.951 OK 0.912 OK 0.908 OK 0.893
OK 0.907
OK 0.928 OK 0.905 OK 0.903 OK 0.874
OK 0.975 OK 0.950 OK 0.899
OK 0.889
OK 0.930 OK 0.928 OK 0.909 OK 0.904
OK 0.930 OK 0.925 OK 0.903 OK 0.885
OK 0.936 OK 0.932 OK 0.906 OK 0.888
OK 0.971 OK 0.966 OK 0.951 OK 0.883
OK 0.948 OK 0.906 OK 0.893 OK 0.873
OK 0.951 OK 0.947 OK 0.941 OK 0.921
OK 0.932 OK 0.924 OK 0.915 OK 0.911 OK 0.908
OK 0.934 OK 0.876 OK 0.868
OK 0.921 OK 0.911 OK 0.909 OK 0.883
OK 0.924 OK 0.912 OK 0.906 OK 0.894
OK 0.881
OK 0.931 OK 0.894
OK 0.907
OK 0.937 OK 0.932 OK 0.932 OK 0.930 OK 0.917
OK 0.931 OK 0.916
OK 0.932 OK 0.925 OK 0.864 OK 0.862
OK 0.909 OK 0.902 OK 0.885 OK 0.874
OK 1.000 OK 0.909
OK 0.883 OK 0.871
OK 0.951 OK 0.924 OK 0.908 OK 0.873
OK 0.932 OK 0.912 OK 0.900
OK 0.904
OK 1.000 OK 0.909
OK 0.877 OK 0.853
OK 0.903 OK 0.853
OK 0.934 OK 0.916 OK 0.873
OK 0.928 OK 0.877
OK 0.967 OK 0.965 OK 0.906 OK 0.870
OK 0.948 OK 0.908 OK 0.908 OK 0.894
OK 0.977 OK 0.969 OK 0.957 OK 0.930 OK 0.924
OK 0.916 OK 0.894
OK 0.934 OK 0.916 OK 0.868
OK 0.952 OK 0.950
OK 0.899 OK 0.891
OK 0.992 OK 0.966 OK 0.941 OK 0.909
OK 0.967 OK 0.933 OK 0.888
OK 0.904
OK 0.928 OK 0.889 OK 0.881 OK 0.853
OK 0.863
OK 0.932 OK 0.900 OK 0.883 OK 0.870
OK 0.937 OK 0.853
OK 0.969 OK 0.963 OK 0.935 OK 0.917 OK 0.908
OK 0.992 OK 0.971 OK 0.947 OK 0.911
OK 0.912 OK 0.864 OK 0.862
OK 0.909 OK 0.909
OK 0.965 OK 0.936 OK 0.933 OK 0.900 OK 0.871
OK 0.975 OK 0.952 OK 0.891 OK 0.870
OK 0.937 OK 0.903
OK 0.870
OK 0.925 OK 0.905 OK 0.904 OK 0.902
OK 0.936 OK 0.864
OK 0.934 OK 0.876 OK 0.873
OK 0.936 OK 0.925 OK 0.900 OK 0.864
OK 0.863
OK 0.982 OK 0.957 OK 0.935 OK 0.932 OK 0.915

@drzraf

Thanks a lot for the detailed feedback, it is great to hear that updated version gives you very reasonable results.

Regarding the model outputs: we discuss their structure, interpretation, and recommended ways to use them (including some visualization ideas) in a paper that is currently under review. As soon as the paper is published, we will add a link to it in the repository so that all the details are documented in one place.

About your second question: the behavior you see is expected. The model takes a list of PIL.Image objects, internally stacks them into a batch, and returns a single tensor of shape [batch_size, embedding_dim] where batch_size is the number of images in your list, which is the usual convention for PyTorch vision models. You may not have to use a Dataset for simple experiments; passing a plain list is perfectly fine. Here is a minimal example you can use to process a set of images:

import glob
from PIL import Image
import torch
import torch.nn.functional as F

paths = glob.glob("*.jpeg")
images = [Image.open(p).convert("RGB") for p in paths]

with torch.no_grad():
    embeddings = model(images)          # shape: [len(images), embedding_dim]
    embeddings = F.normalize(embeddings, dim=1)

print(embeddings.shape)

Thanks again for the kind words and for taking the time to test the model so thoroughly. This kind of feedback is very helpful for us.

Sign up or log in to comment