File size: 4,863 Bytes
e95ac0c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
language:
- en
license: apache-2.0
base_model: allenai/MolmoAct-7B-D-0812
tags:
- awq
- quantized
- 4-bit
- vision-language
- robotics
- molmo
- qwen2.5
- siglip
- llm-compressor
library_name: transformers
pipeline_tag: image-text-to-text
---
# MolmoAct-7B-D AWQ 4-bit (Text-Only Quantization)
This is a 4-bit AWQ quantized version of [allenai/MolmoAct-7B-D-0812](https://huggingface.co/allenai/MolmoAct-7B-D-0812) using [LLM Compressor](https://github.com/vllm-project/llm-compressor).
## Key Features
- β
**Qwen2.5 text decoder quantized** (4-bit AWQ) - 63% size reduction
- β
**SigLip2 vision encoder preserved** (FP16) - maintains visual quality
- β
**Robotic manipulation action reasoning** - trained on 10k robot trajectories
- β
**Smart quantization** - Only LLM layers quantized, vision parts untouched
- β
**93 unique manipulation tasks** supported
## Model Details
- **Base Model:** allenai/MolmoAct-7B-D-0812 (7B parameters)
- **Architecture:** MolmoAct (Qwen2.5-7B decoder + SigLip2 vision encoder)
- **Quantization Method:** AWQ (Activation-aware Weight Quantization)
- **Quantization Scheme:** W4A16 (4-bit weights, 16-bit activations)
- **Calibration Dataset:** Flickr30k (512 samples)
## Size Comparison
| Metric | Value |
|--------|-------|
| **Original (FP16)** | ~14.0 GB |
| **Quantized (W4A16)** | ~6.12 GB |
| **Reduction** | ~56.3% |
| **Memory Saved** | ~7.9 GB |
## What Was Quantized
**Quantized (4-bit):**
- Qwen2.5 decoder layers (text/language model)
- Text processing linear layers in the decoder
**Preserved (FP16):**
- SigLip2 vision encoder (maintains visual understanding quality)
- Vision-text connectors
- Embeddings
- Language model head
This selective quantization ensures that vision understanding quality remains nearly identical to the original model while significantly reducing size.
## About MolmoAct-7B-D
MolmoAct-7B-D is an open-source action reasoning model for robotic manipulation developed by the Allen Institute for AI:
- **Training Data:** 10k high-quality trajectories of a single-arm Franka robot
- **Text Decoder:** Qwen2.5-7B (state-of-the-art open LLM)
- **Vision Encoder:** SigLip2 (proven vision backbone)
- **Capabilities:** 93 unique manipulation tasks
- **Use Case:** Robotic manipulation and action reasoning
## Usage
```python
from transformers import AutoModelForImageTextToText, AutoProcessor, GenerationConfig
from PIL import Image
import requests
# Load model and processor
processor = AutoProcessor.from_pretrained(
"ronantakizawa/molmoact-7b-d-awq-w4a16",
trust_remote_code=True,
torch_dtype='auto',
device_map='auto'
)
model = AutoModelForCausalLM.from_pretrained(
"ronantakizawa/molmoact-7b-d-awq-w4a16",
trust_remote_code=True,
torch_dtype='auto',
device_map='auto'
)
# Process the image and text
inputs = processor.process(
images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)],
text="What actions can be performed with the objects in this image?"
)
# Move inputs to the correct device and make a batch of size 1
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
# Generate output
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
tokenizer=processor.tokenizer
)
# Decode the generated tokens
generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_text)
```
## Quantization Details
- **Method:** AWQ (Activation-aware Weight Quantization)
- **Independent Pipeline:** Used with BasicPipeline for layer-by-layer quantization
- **Calibration:** 512 Flickr30k image-text pairs
- **Max Sequence Length:** 2048 tokens
- **Why AWQ**: Activation-aware quantization preserves important weights
## Limitations
- May have slight quality degradation in complex action reasoning compared to FP16
- Vision encoder is NOT quantized (intentional for quality)
- Requires transformers with AWQ support
- Designed for robotic manipulation tasks, not general conversation
## Important Notes
### Image Requirements
Ensure images are in RGB format:
```python
from PIL import Image
image = Image.open(...)
if image.mode != "RGB":
image = image.convert("RGB")
```
## License
Apache 2.0 (same as base model)
## Citation
```bibtex
@misc{molmoact-7b-d-awq,
title={MolmoAct-7B-D AWQ 4-bit},
author={Quantized by ronantakizawa},
year={2025},
url={https://huggingface.co/ronantakizawa/molmoact-7b-d-awq-w4a16}
}
```
## Acknowledgements
- Base model by [Allen Institute for AI](https://allenai.org/)
- Quantization using [LLM Compressor](https://github.com/vllm-project/llm-compressor)
---
π€ Generated with [LLM Compressor](https://github.com/vllm-project/llm-compressor)
|