gaunernst
/

vit_base_patch16_1024_128.audiomae_as2m

Model card Files Files and versions

vit_base_patch16_1024_128.audiomae_as2m / README.md

gaunernst's picture

Update README.md

a1e9f8f verified about 1 year ago

|

history blame contribute delete

2.65 kB

	---
	license: cc-by-4.0
	library_name: timm
	---

	# Model card for vit_base_patch16_1024_128.audiomae_as2m

	A Vision Transformer (ViT) for audio. Pretrained on AudioSet-2M with Self-Supervised Masked Autoencoder (MAE) method.

	- This is a port of AudioMAE ViT-B/16 weights for usage with `timm`. The naming convention is adopted from other `timm`'s ViT models.
	- See the original repo here: https://github.com/facebookresearch/AudioMAE
	- For the AudioSet-20k fine-tuned checkpoint, see https://huggingface.co/gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k

	NOTE: this model does not have a classification head.

	## Model Details
	- Model Type: Audio feature backbone
	- Papers:
	- Masked Autoencoders that Listen: https://arxiv.org/abs/2207.06405
	- Pretrain Dataset: AudioSet-2M
	- Original: https://github.com/facebookresearch/AudioMAE

	## Model Usage
	### Audio Embeddings

	```python
	import timm
	import torch
	import torch.nn.functional as F
	from torchaudio.compliance import kaldi

	# for fine-tuning, you can pass `num_classes={your number of classes}`
	model = timm.create_model("hf_hub:gaunernst/vit_base_patch16_1024_128.audiomae_as2m", pretrained=True)
	model = model.eval()

	MEAN = -4.2677393
	STD = 4.5689974

	audio = torch.randn(1, 10 * 16_000) # make sure input is 16kHz
	melspec = kaldi.fbank(audio, htk_compat=True, window_type="hanning", num_mel_bins=128) # shape (n_frames, 128)

	# AudioMAE only accepts 1024-frame input
	if melspec.shape[0] < 1024:
	melspec = F.pad(melspec, (0, 0, 0, 1024 - melspec.shape[0]))
	else:
	melspec = melspec[:1024]
	melspec = (melspec - MEAN) / (STD * 2)

	melspec = melspec.view(1, 1, 1024, 128) # add batch dim and channel dim
	output = model(melspec) # embeddings with shape (1, 768)

	# to get frame level embeddings
	output = model.forward_features(melspec) # shape (1, 513, 768)
	output = output[:, 1:] # remove [CLS] token
	output = output.unflatten(1, (1024 // 16, 128 // 16)) # (1, 64, 8, 768) -> 2D patches
	output = output.mean(2) # (1, 64, 768) -> mean pooling across mel dimension
	```

	## Citation
	```bibtex
	@inproceedings{huang2022amae,
	title = {Masked Autoencoders that Listen},
	author = {Huang, Po-Yao and Xu, Hu and Li, Juncheng and Baevski, Alexei and Auli, Michael and Galuba, Wojciech and Metze, Florian and Feichtenhofer, Christoph}
	booktitle = {NeurIPS},
	year = {2022}
	}
	```
	```bibtex
	@misc{rw2019timm,
	author = {Ross Wightman},
	title = {PyTorch Image Models},
	year = {2019},
	publisher = {GitHub},
	journal = {GitHub repository},
	doi = {10.5281/zenodo.4414861},
	howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
	}
	```