|
|
--- |
|
|
title: Open Finance LLM 8B |
|
|
emoji: π |
|
|
colorFrom: red |
|
|
colorTo: red |
|
|
sdk: docker |
|
|
pinned: false |
|
|
app_port: 7860 |
|
|
suggested_hardware: l4x1 |
|
|
--- |
|
|
|
|
|
# Open Finance LLM 8B |
|
|
|
|
|
OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B. |
|
|
|
|
|
## Deployment |
|
|
|
|
|
| Platform | Backend | Dockerfile | Use Case | |
|
|
|----------|---------|------------|----------| |
|
|
| Hugging Face Spaces | Transformers | `Dockerfile` | Development, L4 GPU | |
|
|
| Koyeb | vLLM | `Dockerfile.koyeb` | Production, L40s GPU | |
|
|
|
|
|
## Features |
|
|
|
|
|
- OpenAI-compatible API |
|
|
- Tool/function calling support |
|
|
- Streaming responses |
|
|
- Rate limiting (30 req/min, 500 req/hour) |
|
|
- Statistics tracking via `/v1/stats` |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```bash |
|
|
curl -X POST "https://your-endpoint/v1/chat/completions" \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-d '{ |
|
|
"model": "DragonLLM/Qwen-Open-Finance-R-8B", |
|
|
"messages": [{"role": "user", "content": "What is compound interest?"}], |
|
|
"max_tokens": 500 |
|
|
}' |
|
|
``` |
|
|
|
|
|
```python |
|
|
from openai import OpenAI |
|
|
|
|
|
client = OpenAI(base_url="https://your-endpoint/v1", api_key="not-needed") |
|
|
response = client.chat.completions.create( |
|
|
model="DragonLLM/Qwen-Open-Finance-R-8B", |
|
|
messages=[{"role": "user", "content": "What is compound interest?"}], |
|
|
max_tokens=500 |
|
|
) |
|
|
``` |
|
|
|
|
|
## Configuration |
|
|
|
|
|
| Variable | Required | Default | Description | |
|
|
|----------|----------|---------|-------------| |
|
|
| `HF_TOKEN_LC2` | Yes | - | Hugging Face token | |
|
|
| `MODEL` | No | `DragonLLM/Qwen-Open-Finance-R-8B` | Model name | |
|
|
| `PORT` | No | `8000` (vLLM) / `7860` (Transformers) | Server port | |
|
|
|
|
|
**vLLM-specific (Koyeb):** |
|
|
- `ENABLE_AUTO_TOOL_CHOICE=true` - Enable tool calling |
|
|
- `TOOL_CALL_PARSER=hermes` - Parser for Qwen models |
|
|
- `MAX_MODEL_LEN=8192` - Max context length |
|
|
- `GPU_MEMORY_UTILIZATION=0.90` - GPU memory fraction |
|
|
|
|
|
## API Endpoints |
|
|
|
|
|
| Endpoint | Method | Description | |
|
|
|----------|--------|-------------| |
|
|
| `/v1/models` | GET | List available models | |
|
|
| `/v1/chat/completions` | POST | Chat completion | |
|
|
| `/v1/stats` | GET | Usage statistics | |
|
|
| `/health` | GET | Health check | |
|
|
|
|
|
## Technical Specs |
|
|
|
|
|
- **Model**: DragonLLM/Qwen-Open-Finance-R-8B (8B parameters) |
|
|
- **vLLM Backend**: vllm-openai:latest with hermes tool parser |
|
|
- **Transformers Backend**: 4.45.0+ with PyTorch 2.5.0+ (CUDA 12.4) |
|
|
- **Minimum VRAM**: 20GB (L4), recommended 48GB (L40s) |
|
|
|
|
|
## Development |
|
|
|
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
uvicorn app.main:app --reload --port 8080 |
|
|
pytest tests/ -v |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|