GGUF Conversion Setup Complete β
What Was Created
scripts/convert_to_gguf.py- Main conversion scriptscripts/README_GGUF.md- Detailed usage instructions- Dependencies installed - transformers, torch, sentencepiece, etc.
Quick Start
cd /Users/jeanbapt/simple-llm-pro-finance
source venv/bin/activate
# Convert default model (Qwen-Pro-Finance-R-32B)
python3 scripts/convert_to_gguf.py
# Or specify a different 32B model
python3 scripts/convert_to_gguf.py 2 # qwen3-32b-fin-v1.0
Available 32B Models
The script found these 32B models in DragonLLM:
- DragonLLM/Qwen-Pro-Finance-R-32B β (Recommended - latest)
- DragonLLM/qwen3-32b-fin-v1.0
- DragonLLM/qwen3-32b-fin-v0.3
- DragonLLM/qwen3-32b-fin-v1.0-fp8 (Pre-quantized)
- DragonLLM/Qwen-Pro-Finance-R-32B-FP8 (Pre-quantized)
What the Script Does
- β Checks for llama.cpp (clones if needed)
- β Installs required Python dependencies
- β Converts model to base GGUF (FP16, ~64GB)
- β
Quantizes to multiple levels:
- Q5_K_M (~20GB) - Best balance β
- Q6_K (~24GB) - Higher quality
- Q4_K_M (~16GB) - Smaller size
- Q8_0 (~32GB) - Highest quality
Memory Requirements
- Base conversion: ~64GB RAM (takes 30-60 min)
- Quantization: ~32GB RAM (10-20 min per level)
- Disk space: ~200GB recommended
Output Location
All GGUF files will be saved to:
/Users/jeanbapt/simple-llm-pro-finance/gguf_models/
Recommended Quantization for Mac
Based on your Mac's RAM:
| Mac RAM | Recommended | Alternative |
|---|---|---|
| 32GB | Q5_K_M | Q4_K_M |
| 64GB+ | Q6_K | Q8_0 |
Tool Calling Support
β GGUF models maintain full tool calling capabilities β oLLama supports OpenAI-compatible function calling β Works with your existing PydanticAI agents
Next Steps
Run the conversion (when ready - it takes time):
python3 scripts/convert_to_gguf.pyCreate oLLama model (after conversion):
ollama create qwen-finance-32b -f ModelfileUse with your agents - Update your endpoint config to point to local oLLama
Notes
- The script uses
HF_TOKEN_LC2from your.envfile automatically - llama.cpp is cloned to
simple-llm-pro-finance/llama.cpp/ - You can stop and resume - the script checks for existing files
- Base FP16 file is created first, then quantizations run
Troubleshooting
If you encounter issues:
- Out of memory: Use Q4_K_M instead
- Conversion fails: Check HF token has access to model
- Dependencies missing: Script auto-installs, but you can manually run:
pip install transformers torch sentencepiece protobuf gguf
Ready to convert! Run python3 scripts/convert_to_gguf.py when you're ready (it will take 30-60 minutes).