--- library_name: transformers license: apache-2.0 base_model: Qwen/Qwen2.5-VL-7B-Instruct tags: - axolotl - transformers datasets: - diabolic6045/sanskrit-ocr-parallel-corpus-chat-template - snskrt/Sanskrit_OCR_Parallel_Corpus pipeline_tag: text-generation model-index: - name: qwen2-5-vl-sanskrit-ocr results: - task: type: image-to-text dataset: name: Sanskrit OCR Test Set type: sanskrit-ocr metrics: - name: Exact Match Accuracy type: exact_match value: 1.59 - name: Character-level Accuracy type: character_accuracy value: 86.38 - name: Token-level Jaccard Similarity type: jaccard_similarity value: 50.44 - name: Success Rate type: success_rate value: 100 source: name: Sanskrit OCR Evaluation url: https://huggingface.co/datasets/diabolic6045/sanskrit-ocr-parallel-corpus-chat-template/viewer/default/test language: - sa --- # Sanskrit-Qwen2.5-VL-7B-Instruct-OCR This model is a fine-tuned version of [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) on the [diabolic6045/sanskrit-ocr-parallel-corpus-chat-template](https://huggingface.co/datasets/diabolic6045/sanskrit-ocr-parallel-corpus-chat-template) dataset. This data is converted from [snskrt/Sanskrit_OCR_Parallel_Corpus](https://huggingface.co/datasets/snskrt/Sanskrit_OCR_Parallel_Corpus) by [Sanskrit Datasets](https://huggingface.co/snskrt). It achieves the following results on the evaluation set: - Loss: 0.2660 - Memory/max Mem Active(gib): 20.79 - Memory/max Mem Allocated(gib): 20.79 - Memory/device Mem Reserved(gib): 21.46 ## Model description This is a fine-tuned version of Qwen2.5-VL-7B-Instruct, specifically adapted for Sanskrit OCR (Optical Character Recognition) tasks. The model has been trained using LoRA (Low-Rank Adaptation) on a dataset of Sanskrit text images and their corresponding transcriptions. **Key Features:** - **Base Model**: Qwen/Qwen2.5-VL-7B-Instruct (7 billion parameters) - **Task**: Sanskrit OCR - converting Sanskrit text images to machine-readable text - **Training Method**: LoRA fine-tuning with vision-language capabilities - **Dataset**: Sanskrit OCR Parallel Corpus with chat template formatting - **Architecture**: Vision-Language Model with multimodal understanding **Capabilities:** - Read and transcribe Sanskrit text from images - Handle various Sanskrit scripts and fonts - Process both text and visual inputs simultaneously - Generate accurate Sanskrit text transcriptions The model maintains the original Qwen2.5-VL's vision-language capabilities while being specialized for Sanskrit text recognition tasks. ## Training and evaluation data ### Training Dataset The model was trained on the [diabolic6045/sanskrit-ocr-parallel-corpus-chat-template](https://huggingface.co/datasets/diabolic6045/sanskrit-ocr-parallel-corpus-chat-template) dataset, which contains Sanskrit text images paired with their corresponding transcriptions. The dataset was converted from the original [snskrt/Sanskrit_OCR_Parallel_Corpus](https://huggingface.co/datasets/snskrt/Sanskrit_OCR_Parallel_Corpus) and formatted with chat templates for vision-language training. ### Evaluation Results The model was evaluated on a test set of 314 Sanskrit text samples: | Metric | Value | |:------:|:-----:| | **Total Samples** | 314 | | **Successful Samples** | 314 | | **Failed Samples** | 0 | | **Success Rate** | 100.00% | | **Exact Match Accuracy** | 1.59% | | **Character-level Accuracy** | 86.38% | | **Token-level Jaccard Similarity** | 50.44% | **Key Insights:** - The model successfully processes all test samples without failures - High character-level accuracy (86.38%) indicates good recognition of individual Sanskrit characters - Lower exact match accuracy (1.59%) suggests room for improvement in complete text transcription - Moderate token-level similarity (50.44%) shows reasonable semantic understanding ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0002 - train_batch_size: 2 - eval_batch_size: 2 - seed: 42 - distributed_type: multi-GPU - num_devices: 2 - gradient_accumulation_steps: 4 - total_train_batch_size: 16 - total_eval_batch_size: 4 - optimizer: Use OptimizerNames.ADAMW_BNB with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 110 - training_steps: 1105 ### Training results | Training Loss | Epoch | Step | Validation Loss | Mem Active(gib) | Mem Allocated(gib) | Mem Reserved(gib) | |:-------------:|:-----:|:----:|:---------------:|:---------------:|:------------------:|:-----------------:| | No log | 0 | 0 | 3.3372 | 17.59 | 17.59 | 17.66 | | 0.2428 | 1.0 | 369 | 0.3075 | 20.69 | 20.69 | 21.27 | | 0.2057 | 2.0 | 738 | 0.2660 | 20.79 | 20.79 | 21.46 |
This model was trained using:
[Built with Axolotl](https://github.com/axolotl-ai-cloud/axolotl)
See axolotl config axolotl version: `0.12.2` ```yaml base_model: Qwen/Qwen2.5-VL-7B-Instruct processor_type: AutoProcessor # these 3 lines are needed for now to handle vision chat templates w images skip_prepare_dataset: true remove_unused_columns: false sample_packing: false chat_template: qwen2_vl datasets: - path: sanskrit_multimodal_train.json type: chat_template field_messages: messages dataset_prepared_path: last_run_prepared val_set_size: 0.01 output_dir: ./outputs/out-qwen2-5-vl adapter: lora lora_model_dir: sequence_len: 2048 pad_to_sequence_len: false lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj' wandb_project: Sanskrit-OCR wandb_entity: wandb_watch: wandb_name: qwen2-5-vl-sanskrit-ocr wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 3 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 bf16: true fp16: tf32: true gradient_checkpointing: true logging_steps: 1 flash_attention: true eager_attention: warmup_ratio: 0.1 evals_per_epoch: 1 saves_per_epoch: 1 weight_decay: 0.0 # Automatically upload checkpoint and final model to HF hub_model_id: diabolic6045/qwen2-5-vl-sanskrit-ocr-lora # save_first_step: true # uncomment this to validate checkpoint saving works with your config ```

### Framework versions - PEFT 0.17.0 - Transformers 4.55.2 - Pytorch 2.7.1+cu128 - Datasets 4.0.0 - Tokenizers 0.21.2