Instructions to use gaunernst/vit_base_patch16_1024_128.audiomae_as2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- timm
How to use gaunernst/vit_base_patch16_1024_128.audiomae_as2m with timm:
import timm model = timm.create_model("hf_hub:gaunernst/vit_base_patch16_1024_128.audiomae_as2m", pretrained=True) - Notebooks
- Google Colab
- Kaggle
| license: cc-by-4.0 | |
| library_name: timm | |
| # Model card for vit_base_patch16_1024_128.audiomae_as2m | |
| A Vision Transformer (ViT) for audio. Pretrained on AudioSet-2M with Self-Supervised Masked Autoencoder (MAE) method. | |
| - This is a port of AudioMAE ViT-B/16 weights for usage with `timm`. The naming convention is adopted from other `timm`'s ViT models. | |
| - See the original repo here: https://github.com/facebookresearch/AudioMAE | |
| - For the AudioSet-20k fine-tuned checkpoint, see https://huggingface.co/gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k | |
| NOTE: this model does not have a classification head. | |
| ## Model Details | |
| - **Model Type:** Audio feature backbone | |
| - **Papers:** | |
| - Masked Autoencoders that Listen: https://arxiv.org/abs/2207.06405 | |
| - **Pretrain Dataset:** AudioSet-2M | |
| - **Original:** https://github.com/facebookresearch/AudioMAE | |
| ## Model Usage | |
| ### Audio Embeddings | |
| ```python | |
| import timm | |
| import torch | |
| import torch.nn.functional as F | |
| from torchaudio.compliance import kaldi | |
| # for fine-tuning, you can pass `num_classes={your number of classes}` | |
| model = timm.create_model("hf_hub:gaunernst/vit_base_patch16_1024_128.audiomae_as2m", pretrained=True) | |
| model = model.eval() | |
| MEAN = -4.2677393 | |
| STD = 4.5689974 | |
| audio = torch.randn(1, 10 * 16_000) # make sure input is 16kHz | |
| melspec = kaldi.fbank(audio, htk_compat=True, window_type="hanning", num_mel_bins=128) # shape (n_frames, 128) | |
| # AudioMAE only accepts 1024-frame input | |
| if melspec.shape[0] < 1024: | |
| melspec = F.pad(melspec, (0, 0, 0, 1024 - melspec.shape[0])) | |
| else: | |
| melspec = melspec[:1024] | |
| melspec = (melspec - MEAN) / (STD * 2) | |
| melspec = melspec.view(1, 1, 1024, 128) # add batch dim and channel dim | |
| output = model(melspec) # embeddings with shape (1, 768) | |
| # to get frame level embeddings | |
| output = model.forward_features(melspec) # shape (1, 513, 768) | |
| output = output[:, 1:] # remove [CLS] token | |
| output = output.unflatten(1, (1024 // 16, 128 // 16)) # (1, 64, 8, 768) -> 2D patches | |
| output = output.mean(2) # (1, 64, 768) -> mean pooling across mel dimension | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{huang2022amae, | |
| title = {Masked Autoencoders that Listen}, | |
| author = {Huang, Po-Yao and Xu, Hu and Li, Juncheng and Baevski, Alexei and Auli, Michael and Galuba, Wojciech and Metze, Florian and Feichtenhofer, Christoph} | |
| booktitle = {NeurIPS}, | |
| year = {2022} | |
| } | |
| ``` | |
| ```bibtex | |
| @misc{rw2019timm, | |
| author = {Ross Wightman}, | |
| title = {PyTorch Image Models}, | |
| year = {2019}, | |
| publisher = {GitHub}, | |
| journal = {GitHub repository}, | |
| doi = {10.5281/zenodo.4414861}, | |
| howpublished = {\url{https://github.com/huggingface/pytorch-image-models}} | |
| } | |
| ``` |