| library_name: tf-keras | |
| license: apache-2.0 | |
| title: Video Vision Transformer on medmnist | |
| emoji: 🧑⚕️ | |
| colorFrom: red | |
| colorTo: green | |
| sdk: gradio | |
| app_file: app.py | |
| pinned: false | |
| ## Keras Implementation of Video Vision Transformer on medmnist | |
| This repo contains the model [to this Keras example on Video Vision Transformer](https://keras.io/examples/vision/vivit/). | |
| ## Background Information | |
| This example implements [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Arnab et al., a pure Transformer-based model for video classification. The authors propose a novel embedding scheme and a number of Transformer variants to model video clips. | |
| ## Datasets | |
| We use the [MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification](https://medmnist.com/) dataset. | |
| ## Training Parameters | |
| ``` | |
| # DATA | |
| DATASET_NAME = "organmnist3d" | |
| BATCH_SIZE = 32 | |
| AUTO = tf.data.AUTOTUNE | |
| INPUT_SHAPE = (28, 28, 28, 1) | |
| NUM_CLASSES = 11 | |
| # OPTIMIZER | |
| LEARNING_RATE = 1e-4 | |
| WEIGHT_DECAY = 1e-5 | |
| # TRAINING | |
| EPOCHS = 80 | |
| # TUBELET EMBEDDING | |
| PATCH_SIZE = (8, 8, 8) | |
| NUM_PATCHES = (INPUT_SHAPE[0] // PATCH_SIZE[0]) ** 2 | |
| # ViViT ARCHITECTURE | |
| LAYER_NORM_EPS = 1e-6 | |
| PROJECTION_DIM = 128 | |
| NUM_HEADS = 8 | |
| NUM_LAYERS = 8 | |
| ``` |