Text Generation
GGUF
Rust
English
ruvllm
agent-routing
claude-code
recursive-language-model
embeddings
llm-inference
sona
hnsw
simd
imatrix
conversational
Instructions to use ruv/ruvltra with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ruv/ruvltra with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ruv/ruvltra", filename="ruvltra-claude-code-0.5b-q4_k_m.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ruv/ruvltra with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ruv/ruvltra:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ruv/ruvltra:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ruv/ruvltra:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ruv/ruvltra:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ruv/ruvltra:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf ruv/ruvltra:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ruv/ruvltra:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf ruv/ruvltra:Q4_K_M
Use Docker
docker model run hf.co/ruv/ruvltra:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use ruv/ruvltra with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ruv/ruvltra" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ruv/ruvltra", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ruv/ruvltra:Q4_K_M
- Ollama
How to use ruv/ruvltra with Ollama:
ollama run hf.co/ruv/ruvltra:Q4_K_M
- Unsloth Studio new
How to use ruv/ruvltra with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ruv/ruvltra to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ruv/ruvltra to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ruv/ruvltra to start chatting
- Docker Model Runner
How to use ruv/ruvltra with Docker Model Runner:
docker model run hf.co/ruv/ruvltra:Q4_K_M
- Lemonade
How to use ruv/ruvltra with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ruv/ruvltra:Q4_K_M
Run and chat with the model
lemonade run user.ruvltra-Q4_K_M
List all available models
lemonade list
Add TurboQuant compatibility, v2.1.0 ecosystem tags
Browse files
README.md
CHANGED
|
@@ -16,6 +16,17 @@ tags:
|
|
| 16 |
- simd
|
| 17 |
datasets:
|
| 18 |
- ruvnet/claude-flow-routing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
pipeline_tag: text-generation
|
| 20 |
---
|
| 21 |
|
|
@@ -431,3 +442,48 @@ Apache-2.0 / MIT dual license.
|
|
| 431 |
[Get Started](#quick-start) | [View on GitHub](https://github.com/ruvnet/ruvector)
|
| 432 |
|
| 433 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
- simd
|
| 17 |
datasets:
|
| 18 |
- ruvnet/claude-flow-routing
|
| 19 |
+
- turboquant
|
| 20 |
+
- kv-cache-compression
|
| 21 |
+
- flash-attention
|
| 22 |
+
- speculative-decoding
|
| 23 |
+
- graph-rag
|
| 24 |
+
- hybrid-search
|
| 25 |
+
- vector-database
|
| 26 |
+
- ruvector
|
| 27 |
+
- diskann
|
| 28 |
+
- mamba-ssm
|
| 29 |
+
- colbert
|
| 30 |
pipeline_tag: text-generation
|
| 31 |
---
|
| 32 |
|
|
|
|
| 442 |
[Get Started](#quick-start) | [View on GitHub](https://github.com/ruvnet/ruvector)
|
| 443 |
|
| 444 |
</div>
|
| 445 |
+
|
| 446 |
+
|
| 447 |
+
---
|
| 448 |
+
|
| 449 |
+
## âš¡ TurboQuant KV-Cache Compression
|
| 450 |
+
|
| 451 |
+
RuvLTRA models are fully compatible with **TurboQuant** — 2-4 bit KV-cache quantization that reduces inference memory by 6-8x with <0.5% quality loss.
|
| 452 |
+
|
| 453 |
+
| Quantization | Compression | Quality Loss | Best For |
|
| 454 |
+
|-------------|-------------|--------------|----------|
|
| 455 |
+
| 3-bit | 10.7x | <1% | **Recommended** — best balance |
|
| 456 |
+
| 4-bit | 8x | <0.5% | High quality, long context |
|
| 457 |
+
| 2-bit | 32x | ~2% | Edge devices, max savings |
|
| 458 |
+
|
| 459 |
+
### Usage with RuvLLM
|
| 460 |
+
|
| 461 |
+
```bash
|
| 462 |
+
cargo add ruvllm # Rust
|
| 463 |
+
npm install @ruvector/ruvllm # Node.js
|
| 464 |
+
```
|
| 465 |
+
|
| 466 |
+
```rust
|
| 467 |
+
use ruvllm::quantize::turbo_quant::{TurboQuantCompressor, TurboQuantConfig, TurboQuantBits};
|
| 468 |
+
|
| 469 |
+
let config = TurboQuantConfig {
|
| 470 |
+
bits: TurboQuantBits::Bit3_5, // 10.7x compression
|
| 471 |
+
use_qjl: true,
|
| 472 |
+
..Default::default()
|
| 473 |
+
};
|
| 474 |
+
let compressor = TurboQuantCompressor::new(config)?;
|
| 475 |
+
let compressed = compressor.compress_batch(&kv_vectors)?;
|
| 476 |
+
let scores = compressor.inner_product_batch_optimized(&query, &compressed)?;
|
| 477 |
+
```
|
| 478 |
+
|
| 479 |
+
### v2.1.0 Ecosystem
|
| 480 |
+
|
| 481 |
+
- **Hybrid Search** — Sparse + dense vectors with RRF fusion (20-49% better retrieval)
|
| 482 |
+
- **Graph RAG** — Knowledge graph + community detection for multi-hop queries
|
| 483 |
+
- **DiskANN** — Billion-scale SSD-backed ANN with <10ms latency
|
| 484 |
+
- **FlashAttention-3** — IO-aware tiled attention, O(N) memory
|
| 485 |
+
- **MLA** — Multi-Head Latent Attention (~93% KV-cache compression)
|
| 486 |
+
- **Mamba SSM** — Linear-time selective state space models
|
| 487 |
+
- **Speculative Decoding** — 2-3x generation speedup
|
| 488 |
+
|
| 489 |
+
[RuVector GitHub](https://github.com/ruvnet/ruvector) | [ruvllm crate](https://crates.io/crates/ruvllm) | [@ruvector/ruvllm npm](https://www.npmjs.com/package/@ruvector/ruvllm)
|