QTuneVL1.5-2B developed by the Reconova AI Lab && BDAA-Lab

Introduction

We’re excited to introduce QTuneVL1.5-2B, the latest in Reconova AI Lab’s series of multimodal large language models. Building on QTuneVL1-2B, it incorporates key features from both InternVL and Mini-Monkey to deliver even greater performance.

Like QTuneVL1-2B, QTuneVL1.5-2B is a lightweight MLLM that incorporates cropping and padding strategies from Mini-Monkey/Ureader/InternVL, and has been fine-tuned on InternVL3-2B.

Evaluation

By evaluating our model on eight benchmarks in the OpenCompass leaderboard using VLMEvalKit, we found that it outperformed its predecessor(QTuneVL1-2B) in terms of average scores, particularly on MMStar MMMU_DEV_VAL and OCRBench benchmarks. The eight benchmarks and specific experimental results are as follows:

Eight benchmark: 'MMBench_DEV_EN_V11', 'MMStar', 'MMMU_DEV_VAL', 'MathVista_MINI', 'HallusionBench', 'AI2D_TEST', 'OCRBench', 'MMVet'.

Index	Model	AVG	MMBench_DEV_EN_V11	MMStar	MMMU_DEV_VAL	MathVista_MINI	HallusionBench	AI2D_TEST	OCRBench	MMVet
1	Minimonkey	54.3	71.4	50.3	35.6	46.3	38.6	74.8	802	37.2
2	InternVL2-2B	54.2	71.4	50.3	34.6	47.2	38.2	74.2	783	39.8
3	InternVL2_5-2B	59.4	74.6	53.7	40.1	49.7	42.2	74.9	802	59.5
4	InternVL3-2B	63.5	79.6	61.1	48.6	51.1	42	78.4	835	64.08
5	QTuneVL1-2B	59.7	74.9	53.9	41.5	48.8	43.0	75.2	806	59.6
6	QTuneVL1.5-2B	64.2(+4.5)	79.6(+4.7)	61.4(+7.5)	51.1(+9.6)	51.8(+3)	43.0	78.8(+3.6)	858(+52)	62.1(+2.5)

It is important to note that when using VLMEvalKit for evaluation, the GPT-related evaluation models being called differ slightly from the official ones. In the code (vlmeval/dataset/utils/judge_util.py), it uses:

'gpt-4o-mini': 'gpt-4o-mini' instead of 'gpt-4o-mini': 'gpt-4o-mini-2024-07-18'
'gpt-4-turbo': 'gpt-4-turbo' instead of `'gpt-4-turbo': 'gpt-4-1106-preview'

This configuration will result in evaluation results that slightly differ from the official ones.

Copyright

We welcome suggestions to help us improve the QTuneVL. For any query, please contact HanChao Wang: [email protected]. If you find something interesting, please also feel free to share with us through email or open an issue.

Downloads last month: 247

Safetensors

Model size

2B params

Tensor type

BF16

Paper for hanchaow/QTuneVL1_5-2B

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Paper • 2310.05126 • Published Oct 8, 2023 • 1