Spaces:

jeanbaptdzd
/

open-finance-llm-8b

Paused

jeanbaptdzd commited on 1 day ago

Commit

e3724fa

1 Parent(s): 659e232

Add GGUF conversion script for DragonLLM 32B models

- Add convert_to_gguf.py script to convert HF models to GGUF format
- Support for multiple 32B models (Qwen-Pro-Finance-R-32B, etc.)
- Automatic quantization to Q4_K_M, Q5_K_M, Q6_K, Q8_0
- Auto-install llama.cpp and dependencies
- Documentation with usage instructions and memory requirements
- Ready for oLLama integration with tool calling support

Files changed (3) hide show

scripts/GGUF_CONVERSION_SUMMARY.md +106 -0
scripts/README_GGUF.md +136 -0
scripts/convert_to_gguf.py +279 -0

scripts/GGUF_CONVERSION_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,106 @@

+# GGUF Conversion Setup Complete ✅
+## What Was Created
+1. **`scripts/convert_to_gguf.py`** - Main conversion script
+2. **`scripts/README_GGUF.md`** - Detailed usage instructions
+3. **Dependencies installed** - transformers, torch, sentencepiece, etc.
+## Quick Start
+```bash
+cd /Users/jeanbapt/simple-llm-pro-finance
+source venv/bin/activate
+# Convert default model (Qwen-Pro-Finance-R-32B)
+python3 scripts/convert_to_gguf.py
+# Or specify a different 32B model
+python3 scripts/convert_to_gguf.py 2  # qwen3-32b-fin-v1.0
+```
+## Available 32B Models
+The script found these 32B models in DragonLLM:
+1. **DragonLLM/Qwen-Pro-Finance-R-32B** ⭐ (Recommended - latest)
+2. DragonLLM/qwen3-32b-fin-v1.0
+3. DragonLLM/qwen3-32b-fin-v0.3
+4. DragonLLM/qwen3-32b-fin-v1.0-fp8 (Pre-quantized)
+5. DragonLLM/Qwen-Pro-Finance-R-32B-FP8 (Pre-quantized)
+## What the Script Does
+1. ✅ Checks for llama.cpp (clones if needed)
+2. ✅ Installs required Python dependencies
+3. ✅ Converts model to base GGUF (FP16, ~64GB)
+4. ✅ Quantizes to multiple levels:
+   - **Q5_K_M** (~20GB) - **Best balance** ⭐
+   - Q6_K (~24GB) - Higher quality
+   - Q4_K_M (~16GB) - Smaller size
+   - Q8_0 (~32GB) - Highest quality
+## Memory Requirements
+- **Base conversion**: ~64GB RAM (takes 30-60 min)
+- **Quantization**: ~32GB RAM (10-20 min per level)
+- **Disk space**: ~200GB recommended
+## Output Location
+All GGUF files will be saved to:
+```
+/Users/jeanbapt/simple-llm-pro-finance/gguf_models/
+```
+## Recommended Quantization for Mac
+Based on your Mac's RAM:
+| Mac RAM | Recommended | Alternative |
+|---------|-------------|------------|
+| 32GB    | Q5_K_M      | Q4_K_M     |
+| 64GB+   | Q6_K        | Q8_0       |
+## Tool Calling Support
+✅ GGUF models maintain full tool calling capabilities
+✅ oLLama supports OpenAI-compatible function calling
+✅ Works with your existing PydanticAI agents
+## Next Steps
+1. **Run the conversion** (when ready - it takes time):
+   ```bash
+   python3 scripts/convert_to_gguf.py
+   ```
+2. **Create oLLama model** (after conversion):
+   ```bash
+   ollama create qwen-finance-32b -f Modelfile
+   ```
+3. **Use with your agents** - Update your endpoint config to point to local oLLama
+## Notes
+- The script uses `HF_TOKEN_LC2` from your `.env` file automatically
+- llama.cpp is cloned to `simple-llm-pro-finance/llama.cpp/`
+- You can stop and resume - the script checks for existing files
+- Base FP16 file is created first, then quantizations run
+## Troubleshooting
+If you encounter issues:
+1. **Out of memory**: Use Q4_K_M instead
+2. **Conversion fails**: Check HF token has access to model
+3. **Dependencies missing**: Script auto-installs, but you can manually run:
+   ```bash
+   pip install transformers torch sentencepiece protobuf gguf
+   ```
+---
+**Ready to convert!** Run `python3 scripts/convert_to_gguf.py` when you're ready (it will take 30-60 minutes).

scripts/README_GGUF.md ADDED Viewed

	@@ -0,0 +1,136 @@

+# GGUF Conversion Script
+This script converts DragonLLM models from Hugging Face to GGUF format for use with oLLama on Mac.
+## Quick Start
+```bash
+# Activate virtual environment
+cd /Users/jeanbapt/simple-llm-pro-finance
+source venv/bin/activate
+# Run conversion (uses default: Qwen-Pro-Finance-R-32B)
+python3 scripts/convert_to_gguf.py
+# Or specify a model by number (1-5) or name
+python3 scripts/convert_to_gguf.py 1  # Qwen-Pro-Finance-R-32B
+python3 scripts/convert_to_gguf.py 2  # qwen3-32b-fin-v1.0
+python3 scripts/convert_to_gguf.py "DragonLLM/qwen3-32b-fin-v1.0"
+```
+## Available 32B Models
+1. **DragonLLM/Qwen-Pro-Finance-R-32B** (Recommended - latest)
+2. DragonLLM/qwen3-32b-fin-v1.0
+3. DragonLLM/qwen3-32b-fin-v0.3
+4. DragonLLM/qwen3-32b-fin-v1.0-fp8 (Already quantized to FP8)
+5. DragonLLM/Qwen-Pro-Finance-R-32B-FP8 (Already quantized to FP8)
+## What It Does
+1. **Downloads llama.cpp** (if not already present)
+2. **Converts model to base GGUF** (FP16, ~64GB)
+3. **Quantizes to multiple levels**:
+   - Q5_K_M (~20GB) - **Best balance** ⭐
+   - Q6_K (~24GB) - Higher quality
+   - Q4_K_M (~16GB) - Smaller size
+   - Q8_0 (~32GB) - Highest quality
+## Memory Requirements
+- **Base conversion (FP16)**: ~64GB RAM
+- **Quantization**: ~32GB RAM (can be done separately)
+## Output
+Files are saved to: `simple-llm-pro-finance/gguf_models/`
+```
+gguf_models/
+├── Qwen-Pro-Finance-R-32B-f16.gguf      (~64GB)
+├── Qwen-Pro-Finance-R-32B-q5_k_m.gguf  (~20GB) ⭐ Recommended
+├── Qwen-Pro-Finance-R-32B-q6_k.gguf    (~24GB)
+├── Qwen-Pro-Finance-R-32B-q4_k_m.gguf  (~16GB)
+└── Qwen-Pro-Finance-R-32B-q8_0.gguf    (~32GB)
+```
+## Using with oLLama
+After conversion, create an oLLama model:
+```bash
+# Create Modelfile
+cat > Modelfile << EOF
+FROM ./gguf_models/Qwen-Pro-Finance-R-32B-q5_k_m.gguf
+TEMPLATE """{{ if .System }}<|im_start|>system
+{{ .System }}<|im_end|>
+{{ end }}{{ if .Prompt }}<|im_start|>user
+{{ .Prompt }}<|im_end|>
+{{ end }}<|im_start|>assistant
+{{ .Response }}<|im_end|>
+"""
+PARAMETER num_ctx 8192
+PARAMETER temperature 0.7
+EOF
+# Create model
+ollama create qwen-finance-32b -f Modelfile
+# Use it
+ollama run qwen-finance-32b "What is compound interest?"
+```
+## Tool Calling Support
+GGUF models maintain tool calling capabilities. oLLama supports OpenAI-compatible function calling:
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
+response = client.chat.completions.create(
+    model="qwen-finance-32b",
+    messages=[{"role": "user", "content": "Calculate future value of 10000 at 5% for 10 years"}],
+    tools=[{
+        "type": "function",
+        "function": {
+            "name": "calculate_fv",
+            "description": "Calculate future value",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "pv": {"type": "number"},
+                    "rate": {"type": "number"},
+                    "nper": {"type": "number"}
+                }
+            }
+        }
+    }],
+    tool_choice="auto"
+)
+```
+## Troubleshooting
+### Out of Memory
+- Use Q4_K_M instead of Q5_K_M
+- Close other applications
+- Reduce context window in oLLama (`num_ctx 4096`)
+### Conversion Fails
+- Ensure HF_TOKEN_LC2 is set in .env
+- Check you have access to the model on Hugging Face
+- Verify you have enough disk space (~200GB recommended)
+### Quantization Fails
+- The base FP16 file is still usable
+- Try quantizing manually: `./llama.cpp/llama-quantize input.gguf output.gguf Q5_K_M`
+## Notes
+- **FP8 models** (models 4 and 5) are already quantized, but converting to GGUF still provides benefits for oLLama
+- **Q5_K_M is recommended** for best quality/size trade-off on Mac
+- Conversion takes 30-60 minutes depending on your system
+- Quantization takes 10-20 minutes per level

scripts/convert_to_gguf.py ADDED Viewed

	@@ -0,0 +1,279 @@

+#!/usr/bin/env python3
+"""
+Convert DragonLLM models from Hugging Face to GGUF format.
+This script:
+1. Downloads the model from Hugging Face
+2. Converts it to GGUF format using llama.cpp
+3. Quantizes to multiple levels (Q4_K_M, Q5_K_M, Q6_K, Q8_0)
+Requirements:
+- llama.cpp installed (git clone https://github.com/ggerganov/llama.cpp.git)
+- Python packages: huggingface_hub, python-dotenv
+"""
+import os
+import sys
+import subprocess
+import shutil
+from pathlib import Path
+from typing import Optional
+from dotenv import load_dotenv
+# Load environment variables
+ENV_FILE = Path(__file__).parent.parent / ".env"
+if ENV_FILE.exists():
+    load_dotenv(ENV_FILE)
+HF_TOKEN = os.getenv("HF_TOKEN_LC2") or os.getenv("HF_TOKEN") or os.getenv("HUGGING_FACE_HUB_TOKEN")
+# Available 32B models found
+AVAILABLE_32B_MODELS = [
+    "DragonLLM/Qwen-Pro-Finance-R-32B",
+    "DragonLLM/qwen3-32b-fin-v1.0",
+    "DragonLLM/qwen3-32b-fin-v0.3",
+    "DragonLLM/qwen3-32b-fin-v1.0-fp8",
+    "DragonLLM/Qwen-Pro-Finance-R-32B-FP8",
+]
+# Quantization levels (best trade-off first)
+QUANTIZATIONS = [
+    ("Q5_K_M", "~20GB", "Best balance of quality and size"),
+    ("Q6_K", "~24GB", "Higher quality"),
+    ("Q4_K_M", "~16GB", "Smaller size, good quality"),
+    ("Q8_0", "~32GB", "Highest quality, larger size"),
+]
+def check_llama_cpp() -> Optional[Path]:
+    """Check if llama.cpp is available."""
+    # Check common locations
+    possible_paths = [
+        Path.home() / "llama.cpp",
+        Path(__file__).parent.parent / "llama.cpp",
+        Path("/usr/local/llama.cpp"),
+    ]
+    for path in possible_paths:
+        # Try both naming conventions
+        convert_script = path / "convert_hf_to_gguf.py"
+        if not convert_script.exists():
+            convert_script = path / "convert-hf-to-gguf.py"
+        quantize_bin = path / "llama-quantize"
+        if convert_script.exists() and (quantize_bin.exists() or (path / "llama-quantize.exe").exists()):
+            return path
+    return None
+def install_llama_cpp(target_dir: Path) -> Path:
+    """Clone and set up llama.cpp."""
+    print(f"📦 Cloning llama.cpp to {target_dir}...")
+    if target_dir.exists():
+        print(f"   {target_dir} already exists, using existing installation")
+        return target_dir
+    try:
+        subprocess.run(
+            ["git", "clone", "https://github.com/ggerganov/llama.cpp.git", str(target_dir)],
+            check=True,
+            capture_output=True,
+        )
+        print("✅ llama.cpp cloned successfully")
+        # Install Python requirements for conversion
+        requirements = target_dir / "requirements" / "requirements-convert_hf_to_gguf.txt"
+        if not requirements.exists():
+            requirements = target_dir / "requirements.txt"
+        if requirements.exists():
+            print("📦 Installing Python requirements for llama.cpp conversion...")
+            subprocess.run(
+                [sys.executable, "-m", "pip", "install", "-r", str(requirements), "--quiet"],
+                check=False,  # Don't fail if some packages are already installed
+            )
+        # Try to build (optional, but faster)
+        print("🔨 Attempting to build llama-quantize (optional)...")
+        try:
+            subprocess.run(["make", "-C", str(target_dir)], check=True, capture_output=True)
+            print("✅ Build successful")
+        except (subprocess.CalledProcessError, FileNotFoundError):
+            print("⚠️  Build failed or make not available, will use Python quantize")
+        return target_dir
+    except subprocess.CalledProcessError as e:
+        print(f"❌ Error cloning llama.cpp: {e}")
+        sys.exit(1)
+def convert_to_gguf(
+    model_name: str,
+    output_dir: Path,
+    llama_cpp_dir: Path,
+    hf_token: str,
+) -> Path:
+    """Convert Hugging Face model to GGUF format."""
+    output_dir.mkdir(parents=True, exist_ok=True)
+    base_name = model_name.split("/")[-1].replace(".", "-")
+    output_file = output_dir / f"{base_name}-f16.gguf"
+    if output_file.exists():
+        print(f"✅ Base GGUF file already exists: {output_file}")
+        return output_file
+    print(f"🔄 Converting {model_name} to GGUF (FP16)...")
+    print(f"   This may take 30-60 minutes and requires ~64GB RAM...")
+    # Try both naming conventions
+    convert_script = llama_cpp_dir / "convert_hf_to_gguf.py"
+    if not convert_script.exists():
+        convert_script = llama_cpp_dir / "convert-hf-to-gguf.py"
+    try:
+        subprocess.run(
+            [
+                sys.executable,
+                str(convert_script),
+                "--outdir", str(output_dir),
+                "--outfile", output_file.name,
+                model_name,
+                "--token", hf_token,
+            ],
+            check=True,
+        )
+        print(f"✅ Conversion complete: {output_file}")
+        return output_file
+    except subprocess.CalledProcessError as e:
+        print(f"❌ Conversion failed: {e}")
+        sys.exit(1)
+def quantize_gguf(
+    input_file: Path,
+    output_dir: Path,
+    llama_cpp_dir: Path,
+    quantizations: list,
+) -> list[Path]:
+    """Quantize GGUF file to different levels."""
+    quantized_files = []
+    # Try binary quantize first, fallback to Python
+    quantize_bin = llama_cpp_dir / "llama-quantize"
+    if not quantize_bin.exists():
+        quantize_bin = llama_cpp_dir / "llama-quantize.exe"
+    use_binary = quantize_bin.exists()
+    if not use_binary:
+        print("⚠️  Binary quantize not found, will use Python quantize (slower)")
+        quantize_script = llama_cpp_dir / "quantize.py"
+        if not quantize_script.exists():
+            print("❌ No quantize tool found!")
+            return []
+    for qtype, size, description in quantizations:
+        output_file = output_dir / input_file.name.replace("-f16.gguf", f"-{qtype.lower()}.gguf")
+        if output_file.exists():
+            print(f"✅ {qtype} already exists: {output_file}")
+            quantized_files.append(output_file)
+            continue
+        print(f"🔄 Quantizing to {qtype} ({size}, {description})...")
+        try:
+            if use_binary:
+                subprocess.run(
+                    [str(quantize_bin), str(input_file), str(output_file), qtype],
+                    check=True,
+                )
+            else:
+                subprocess.run(
+                    [
+                        sys.executable,
+                        str(quantize_script),
+                        str(input_file),
+                        str(output_file),
+                        qtype,
+                    ],
+                    check=True,
+                )
+            print(f"✅ {qtype} complete: {output_file}")
+            quantized_files.append(output_file)
+        except subprocess.CalledProcessError as e:
+            print(f"⚠️  Quantization to {qtype} failed: {e}")
+            continue
+    return quantized_files
+def main():
+    """Main conversion script."""
+    if not HF_TOKEN:
+        print("❌ Error: HF_TOKEN_LC2 not found in environment")
+        print("   Please set it in .env file or environment variables")
+        sys.exit(1)
+    # Select model
+    print("Available 32B models:")
+    for i, model in enumerate(AVAILABLE_32B_MODELS, 1):
+        print(f"  {i}. {model}")
+    if len(sys.argv) > 1:
+        try:
+            model_idx = int(sys.argv[1]) - 1
+            if 0 <= model_idx < len(AVAILABLE_32B_MODELS):
+                model_name = AVAILABLE_32B_MODELS[model_idx]
+            else:
+                model_name = sys.argv[1]  # Use as model name directly
+        except ValueError:
+            model_name = sys.argv[1]  # Use as model name directly
+    else:
+        # Default to best model
+        model_name = AVAILABLE_32B_MODELS[0]
+        print(f"\nUsing default model: {model_name}")
+        print("   (Pass model number or name as argument to use different model)")
+    print(f"\n🎯 Target model: {model_name}")
+    # Setup directories
+    script_dir = Path(__file__).parent.parent
+    output_dir = script_dir / "gguf_models"
+    llama_cpp_dir = script_dir / "llama.cpp"
+    # Check/install llama.cpp
+    llama_cpp_path = check_llama_cpp()
+    if not llama_cpp_path:
+        print("📦 llama.cpp not found, installing...")
+        llama_cpp_path = install_llama_cpp(llama_cpp_dir)
+    else:
+        print(f"✅ Found llama.cpp at: {llama_cpp_path}")
+    # Convert to GGUF
+    base_gguf = convert_to_gguf(model_name, output_dir, llama_cpp_path, HF_TOKEN)
+    # Quantize
+    print(f"\n📊 Quantizing to multiple levels...")
+    quantized = quantize_gguf(base_gguf, output_dir, llama_cpp_path, QUANTIZATIONS)
+    # Summary
+    print(f"\n✅ Conversion complete!")
+    print(f"\n📁 Output directory: {output_dir}")
+    print(f"\n📦 Generated files:")
+    print(f"   - {base_gguf.name} ({base_gguf.stat().st_size / (1024**3):.1f} GB)")
+    for qfile in quantized:
+        size_gb = qfile.stat().st_size / (1024**3)
+        print(f"   - {qfile.name} ({size_gb:.1f} GB)")
+    print(f"\n💡 Recommended for Mac:")
+    print(f"   - 32GB RAM: Use Q5_K_M or Q4_K_M")
+    print(f"   - 64GB+ RAM: Use Q6_K or Q8_0")
+    print(f"\n🚀 To use with oLLama:")
+    print(f"   ollama create {model_name.split('/')[-1].lower()} -f <(echo 'FROM {quantized[0] if quantized else base_gguf}')")
+if __name__ == "__main__":
+    main()