File size: 4,863 Bytes
e95ac0c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
language:
- en
license: apache-2.0
base_model: allenai/MolmoAct-7B-D-0812
tags:
- awq
- quantized
- 4-bit
- vision-language
- robotics
- molmo
- qwen2.5
- siglip
- llm-compressor
library_name: transformers
pipeline_tag: image-text-to-text
---

# MolmoAct-7B-D AWQ 4-bit (Text-Only Quantization)

This is a 4-bit AWQ quantized version of [allenai/MolmoAct-7B-D-0812](https://huggingface.co/allenai/MolmoAct-7B-D-0812) using [LLM Compressor](https://github.com/vllm-project/llm-compressor).

## Key Features

- βœ… **Qwen2.5 text decoder quantized** (4-bit AWQ) - 63% size reduction
- βœ… **SigLip2 vision encoder preserved** (FP16) - maintains visual quality
- βœ… **Robotic manipulation action reasoning** - trained on 10k robot trajectories
- βœ… **Smart quantization** - Only LLM layers quantized, vision parts untouched
- βœ… **93 unique manipulation tasks** supported

## Model Details

- **Base Model:** allenai/MolmoAct-7B-D-0812 (7B parameters)
- **Architecture:** MolmoAct (Qwen2.5-7B decoder + SigLip2 vision encoder)
- **Quantization Method:** AWQ (Activation-aware Weight Quantization)
- **Quantization Scheme:** W4A16 (4-bit weights, 16-bit activations)
- **Calibration Dataset:** Flickr30k (512 samples)

## Size Comparison

| Metric | Value |
|--------|-------|
| **Original (FP16)** | ~14.0 GB |
| **Quantized (W4A16)** | ~6.12 GB |
| **Reduction** | ~56.3% |
| **Memory Saved** | ~7.9 GB |

## What Was Quantized

**Quantized (4-bit):**
- Qwen2.5 decoder layers (text/language model)
- Text processing linear layers in the decoder

**Preserved (FP16):**
- SigLip2 vision encoder (maintains visual understanding quality)
- Vision-text connectors
- Embeddings
- Language model head

This selective quantization ensures that vision understanding quality remains nearly identical to the original model while significantly reducing size.

## About MolmoAct-7B-D

MolmoAct-7B-D is an open-source action reasoning model for robotic manipulation developed by the Allen Institute for AI:

- **Training Data:** 10k high-quality trajectories of a single-arm Franka robot
- **Text Decoder:** Qwen2.5-7B (state-of-the-art open LLM)
- **Vision Encoder:** SigLip2 (proven vision backbone)
- **Capabilities:** 93 unique manipulation tasks
- **Use Case:** Robotic manipulation and action reasoning

## Usage

```python
from transformers import AutoModelForImageTextToText, AutoProcessor, GenerationConfig
from PIL import Image
import requests

# Load model and processor
processor = AutoProcessor.from_pretrained(
    "ronantakizawa/molmoact-7b-d-awq-w4a16",
    trust_remote_code=True,
    torch_dtype='auto',
    device_map='auto'
)

model = AutoModelForCausalLM.from_pretrained(
    "ronantakizawa/molmoact-7b-d-awq-w4a16",
    trust_remote_code=True,
    torch_dtype='auto',
    device_map='auto'
)

# Process the image and text
inputs = processor.process(
    images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)],
    text="What actions can be performed with the objects in this image?"
)

# Move inputs to the correct device and make a batch of size 1
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

# Generate output
output = model.generate_from_batch(
    inputs,
    GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
    tokenizer=processor.tokenizer
)

# Decode the generated tokens
generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_text)
```

## Quantization Details

- **Method:** AWQ (Activation-aware Weight Quantization)
- **Independent Pipeline:** Used with BasicPipeline for layer-by-layer quantization
- **Calibration:** 512 Flickr30k image-text pairs
- **Max Sequence Length:** 2048 tokens
- **Why AWQ**: Activation-aware quantization preserves important weights

## Limitations

- May have slight quality degradation in complex action reasoning compared to FP16
- Vision encoder is NOT quantized (intentional for quality)
- Requires transformers with AWQ support
- Designed for robotic manipulation tasks, not general conversation

## Important Notes

### Image Requirements
Ensure images are in RGB format:
```python
from PIL import Image
image = Image.open(...)
if image.mode != "RGB":
    image = image.convert("RGB")
```

## License

Apache 2.0 (same as base model)

## Citation

```bibtex
@misc{molmoact-7b-d-awq,
  title={MolmoAct-7B-D AWQ 4-bit},
  author={Quantized by ronantakizawa},
  year={2025},
  url={https://huggingface.co/ronantakizawa/molmoact-7b-d-awq-w4a16}
}
```

## Acknowledgements

- Base model by [Allen Institute for AI](https://allenai.org/)
- Quantization using [LLM Compressor](https://github.com/vllm-project/llm-compressor)

---

πŸ€– Generated with [LLM Compressor](https://github.com/vllm-project/llm-compressor)