Octen-Embedding-0.6B β FP32 ONNX (dynamo export)
ONNX export of Octen/Octen-Embedding-0.6B, the Qwen3-0.6B fine-tune for semantic search and retrieval.
This is the full-precision (FP32) reference export. For production use, prefer the INT8 or INT4 variants which are 2β4Γ smaller with negligible quality loss.
Export method
Exported with torch.onnx.export(dynamo=True) (PyTorch 2.9, opset 20).
The dynamo exporter traces at the FX-graph / symbolic level rather than via eager execution. This means all internal tensor shapes β including the Qwen3 causal attention mask β carry symbolic batch and sequence dimensions throughout. The legacy torch.onnx.export produced a static batch=1 inside the causal-mask BitAnd node, breaking inference for batch > 1.
Dynamic batch verified: batch = 1, 2, 4, 8 all produce correct output shapes.
Model details
| Property | Value |
|---|---|
| Base model | Qwen3-Embedding-0.6B (Qwen/Qwen3-Embedding-0.6B) |
| Fine-tune | Octen/Octen-Embedding-0.6B |
| Parameters | 596 M |
| Embedding dim | 1024 |
| Max context | 32 768 tokens |
| Inputs | input_ids [batch, seq], attention_mask [batch, seq] |
| Output | last_hidden_state [batch, seq, 1024] |
| Pooling | Last-token pooling + L2 normalisation (applied by the inference runtime) |
| File size | ~2.4 GB (model.onnx + model.onnx.data) |
Inference
import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=0)
tokenizer.enable_truncation(max_length=512)
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
texts = ["semantic search example", "another sentence"]
enc = tokenizer.encode_batch(texts)
ids = np.array([e.ids for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)
lhs = session.run(None, {"input_ids": ids, "attention_mask": mask})[0] # [batch, seq, 1024]
# Last-token pooling: take embedding at last non-padding position
seq_lens = mask.sum(axis=1) - 1
embeddings = lhs[np.arange(len(texts)), seq_lens]
# L2 normalise
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / np.maximum(norms, 1e-8)
print(embeddings.shape) # (2, 1024)
Files
| File | Size | Description |
|---|---|---|
model.onnx |
~4 MB | ONNX graph (opset 20, no weights) |
model.onnx.data |
~2.38 GB | External weight data |
tokenizer.json |
11 MB | HuggingFace fast tokenizer |
config.json |
β | Model config |
export_octen_onnx_dynamo.py |
β | Reproduction script (PyTorch β₯ 2.1) |
Quantized variants
| Repo | Precision | Size | Notes |
|---|---|---|---|
| cstr/octen-embedding-0.6b-onnx | FP32 | 2.4 GB | This repo β reference |
| cstr/octen-embedding-0.6b-onnx-int8 | INT8 per-channel | 1.1 GB | Recommended for most use |
| cstr/octen-embedding-0.6b-onnx-int4 | INT4 MatMulNBits | 0.9 GB | Minimum RAM |
License
Apache 2.0 β same as Octen/Octen-Embedding-0.6B and Qwen/Qwen3-Embedding-0.6B.
- Downloads last month
- -
Model tree for cstr/Octen-Embedding-0.6B-ONNX
Base model
Qwen/Qwen3-0.6B-Base