Spaces:

jeanbaptdzd
/

open-finance-llm-8b

Paused

jeanbaptdzd commited on 21 days ago

Commit

7239fe3

1 Parent(s): c495666

fix: vLLM tool calling - enable by default with hermes parser

- Fix --enable-auto-tool-choice requires --tool-call-parser error
- Default TOOL_CALL_PARSER=hermes for Qwen models
- Default ENABLE_AUTO_TOOL_CHOICE=true
- Update Dockerfile.koyeb with vLLM backend
- Clean up deprecated files
- Update README with deployment options

Files changed (8) hide show

Dockerfile +3 -12
Dockerfile.koyeb +2 -0
KOYEB_VLLM_DEPLOYMENT.md +0 -93
README.md +47 -124
app/providers/transformers_provider.py +56 -166
docs/STRUCTURED_OUTPUTS_COMPARISON.md +0 -132
start-vllm.sh +20 -12
start.sh +0 -10

Dockerfile CHANGED Viewed

@@ -68,23 +68,14 @@ RUN test -f /app/app/providers/transformers_provider.py && \
     grep -q "def initialize_model" /app/app/providers/transformers_provider.py || \
     (echo "ERROR: transformers_provider.py not found or invalid!" && exit 1)
-# Copy startup script
-COPY start.sh /app/start.sh
 # Create non-root user and cache directories in single layer
 # Use ${HF_HOME} variable (defaults to /tmp/huggingface if not set)
 RUN useradd -m -u 1000 user && \
     mkdir -p ${HF_HOME:-/tmp/huggingface} /tmp/torch/inductor /tmp/triton && \
-    chmod +x /app/start.sh && \
-    chown -R user:user /app ${HF_HOME:-/tmp/huggingface} /tmp/torch /tmp/triton && \
-    # Verify startup script is executable and has correct shebang
-    test -x /app/start.sh && head -1 /app/start.sh | grep -q "^#!/bin/bash" || (echo "ERROR: start.sh not executable or wrong shebang!" && exit 1)
 USER user
-# Expose ports for both HF Spaces (7860) and Koyeb (8000)
-# PORT environment variable controls which port the app actually uses
-EXPOSE 7860 8000
-# Use startup script for more reliable execution
-CMD ["/app/start.sh"]

     grep -q "def initialize_model" /app/app/providers/transformers_provider.py || \
     (echo "ERROR: transformers_provider.py not found or invalid!" && exit 1)
 # Create non-root user and cache directories in single layer
 # Use ${HF_HOME} variable (defaults to /tmp/huggingface if not set)
 RUN useradd -m -u 1000 user && \
     mkdir -p ${HF_HOME:-/tmp/huggingface} /tmp/torch/inductor /tmp/triton && \
+    chown -R user:user /app ${HF_HOME:-/tmp/huggingface} /tmp/torch /tmp/triton
 USER user
+EXPOSE 7860
+CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "7860"]

Dockerfile.koyeb CHANGED Viewed

@@ -1,4 +1,5 @@
 # Koyeb-optimized Dockerfile using official vLLM OpenAI image
 # Uses ENTRYPOINT to ensure args aren't overridden by Koyeb
 FROM vllm/vllm-openai:latest
@@ -19,3 +20,4 @@ EXPOSE 8000
 # Use ENTRYPOINT so it can't be overridden by empty Koyeb args
 ENTRYPOINT ["/start-vllm.sh"]

 # Koyeb-optimized Dockerfile using official vLLM OpenAI image
+# Compatible with Koyeb's one-click deployment patterns for Qwen + vLLM
 # Uses ENTRYPOINT to ensure args aren't overridden by Koyeb
 FROM vllm/vllm-openai:latest
 # Use ENTRYPOINT so it can't be overridden by empty Koyeb args
 ENTRYPOINT ["/start-vllm.sh"]

KOYEB_VLLM_DEPLOYMENT.md DELETED Viewed

@@ -1,93 +0,0 @@
-# Koyeb vLLM Deployment
-## Overview
-The Koyeb deployment uses **vLLM's native OpenAI-compatible API server** with full CUDA optimizations.
-## Docker Image
-**Public image on Docker Hub:**
-```
-jeanbapt/dragon-llm-inference:vllm-amd64
-```
-**Important:** Must be built with `--platform linux/amd64` for Koyeb GPU instances.
-Built from `Dockerfile.koyeb` with:
-- Base: `vllm/vllm-openai:latest`
-- Custom startup script for env var configuration
-- Flash Attention 2, PagedAttention, continuous batching
-## Koyeb Configuration
-### Environment Variables
-| Variable | Value | Description |
-|----------|-------|-------------|
-| `HF_TOKEN_LC2` | (secret) | Hugging Face token for model access |
-| `MODEL` | `DragonLLM/Qwen-Open-Finance-R-8B` | Model to load |
-| `PORT` | `8000` | Server port |
-| `MAX_MODEL_LEN` | `8192` | Max context length |
-| `GPU_MEMORY_UTILIZATION` | `0.90` | GPU memory usage |
-### Instance Type
-- **Recommended**: `gpu-nvidia-l40s` (48GB VRAM) in Iowa (`dsm`)
-- **Alternative**: `gpu-nvidia-rtx-4000-sff-ada` (20GB VRAM) in Frankfurt (`fra`)
-### Health Check
-- **Type**: TCP
-- **Port**: 8000
-- **Grace Period**: 900 seconds (15 minutes for model loading)
-## API Endpoints (vLLM Native)
-```
-POST /v1/chat/completions  - Chat completions (OpenAI compatible)
-POST /v1/completions       - Text completions
-GET  /v1/models            - List models
-GET  /health               - Health check
-```
-## Usage Example
-```python
-from openai import OpenAI
-client = OpenAI(
-    base_url="https://dragon-llm-open-finance-inference.koyeb.app/v1",
-    api_key="not-needed"
-)
-response = client.chat.completions.create(
-    model="DragonLLM/Qwen-Open-Finance-R-8B",
-    messages=[
-        {"role": "user", "content": "Analyze the impact of rising interest rates"}
-    ],
-    temperature=0.7,
-    max_tokens=1024
-)
-```
-## Build & Push
-```bash
-# Build for linux/amd64 (required for Koyeb GPU)
-docker buildx build --platform linux/amd64 \
-  -f Dockerfile.koyeb \
-  -t jeanbapt/dragon-llm-inference:vllm-amd64 \
-  --push .
-```
-## Troubleshooting
-### "Application exited with code 8" with no logs
-1. **Wrong architecture**: Ensure image is built for `linux/amd64`, not ARM
-2. **GPU allocation failed**: Try different region or GPU type
-3. **Container crash**: Check if `python3` is used (not `python`)
-### Model download issues
-Ensure `HF_TOKEN_LC2` is set with access to the model.

README.md CHANGED Viewed

@@ -15,93 +15,38 @@ OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B.
 ## Deployment Options
-| Platform | Backend | Docker Image | Port |
-|----------|---------|--------------|------|
-| **HF Spaces** | Transformers | Default (builds from `Dockerfile`) | 7860 |
-| **Koyeb** | vLLM (optimized) | `jeanbapt/dragon-llm-inference:vllm` | 8000 |
-### Docker Hub Public Images
-```
-jeanbapt/dragon-llm-inference:vllm-amd64  # Koyeb - vLLM with CUDA optimizations (linux/amd64)
-jeanbapt/dragon-llm-inference:latest      # HF Spaces - Transformers backend
-```
 ## Features
-- **OpenAI-compatible API** - Drop-in replacement for OpenAI SDK
-- **French and English support** - Automatic language detection
-- **Rate limiting** - Built-in protection (30 req/min, 500 req/hour)
-- **Statistics tracking** - Token usage and request metrics via `/v1/stats`
-- **Health monitoring** - Model readiness status in `/health` endpoint
-- **Streaming support** - Real-time response streaming
-- **Tool calls support** - OpenAI-compatible tool/function calling
-- **Structured outputs** - JSON format support via `response_format`
-## API Endpoints
-### Chat Completions
 ```bash
 curl -X POST "https://your-endpoint/v1/chat/completions" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "DragonLLM/Qwen-Open-Finance-R-8B",
     "messages": [{"role": "user", "content": "What is compound interest?"}],
-    "temperature": 0.7,
     "max_tokens": 500
   }'
 ```
-### List Models
-```bash
-curl -X GET "https://your-endpoint/v1/models"
-```
-### Streaming
-```bash
-curl -X POST "https://your-endpoint/v1/chat/completions" \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "DragonLLM/Qwen-Open-Finance-R-8B",
-    "messages": [{"role": "user", "content": "Explain Value at Risk"}],
-    "stream": true
-  }'
-```
-### Health Check
-```bash
-curl -X GET "https://your-endpoint/health"
-```
-## Configuration
-### Environment Variables
-**Required:**
-- `HF_TOKEN_LC2` - Hugging Face token with access to DragonLLM models
-**Optional:**
-- `MODEL` - Model name (default: `DragonLLM/Qwen-Open-Finance-R-8B`)
-- `PORT` - Server port (default: 7860 for HF, 8000 for Koyeb)
-- `SERVICE_API_KEY` - API key for authentication
-- `LOG_LEVEL` - Logging level (default: `info`)
-Token priority: `HF_TOKEN_LC2` > `HF_TOKEN_LC` > `HF_TOKEN` > `HUGGING_FACE_HUB_TOKEN`
-**Note:** Accept model terms at https://huggingface.co/DragonLLM/Qwen-Open-Finance-R-8B before use.
-## Integration
 ### OpenAI SDK
 ```python
 from openai import OpenAI
-client = OpenAI(
-    base_url="https://your-endpoint/v1",
-    api_key="not-needed"  # or your SERVICE_API_KEY
-)
 response = client.chat.completions.create(
     model="DragonLLM/Qwen-Open-Finance-R-8B",
     messages=[{"role": "user", "content": "What is compound interest?"}],
@@ -109,86 +54,64 @@ response = client.chat.completions.create(
 )
 ```
-## Koyeb Deployment (vLLM)
-The Koyeb deployment uses vLLM's native OpenAI-compatible server with full CUDA optimizations:
-- **Flash Attention 2** - Faster attention computation
-- **PagedAttention** - Efficient GPU memory management
-- **Continuous batching** - High throughput inference
-- **Prefix caching** - Reuse KV cache for common prefixes
-See [KOYEB_VLLM_DEPLOYMENT.md](KOYEB_VLLM_DEPLOYMENT.md) for detailed setup.
-### Quick Deploy to Koyeb
-1. Create app in Koyeb dashboard
-2. Set Docker image: `jeanbapt/dragon-llm-inference:vllm`
-3. Add environment variables:
-   - `MODEL`: `DragonLLM/Qwen-Open-Finance-R-8B`
-   - `HF_TOKEN_LC2`: (your HF token as secret)
-   - `PORT`: `8000`
-4. Select GPU instance (L40s recommended)
-5. Set health check: `GET /health` on port 8000
-## Technical Specifications
-**Model:**
-- DragonLLM/Qwen-Open-Finance-R-8B (8B parameters)
-- Fine-tuned on financial data
-- English and French support
-**HF Spaces Backend:**
-- Transformers 4.45.0+
-- PyTorch 2.5.0+ (CUDA 12.4)
-**Koyeb Backend:**
-- vLLM 0.6.0+
-- Flash Attention 2
-- CUDA 12.4
-**Hardware:**
-- Minimum: L4 GPU (24GB VRAM)
-- Recommended: L40s GPU (48GB VRAM)
-## Project Structure
-```
-.
-├── app/                      # Main API application
-│   ├── main.py              # FastAPI app (HF Spaces)
-│   ├── routers/             # API routes
-│   ├── providers/           # Model providers (Transformers)
-│   ├── middleware/          # Rate limiting, auth
-│   └── utils/               # Utilities, stats tracking
-├── Dockerfile               # HF Spaces (Transformers)
-├── Dockerfile.koyeb         # Koyeb (vLLM)
-├── start.sh                 # HF Spaces startup
-├── start-vllm.sh            # Koyeb vLLM startup
-├── docs/                    # Technical documentation
-└── tests/                   # Test suite
-```
 ## Development
-### Local Setup
 ```bash
 pip install -r requirements.txt
 uvicorn app.main:app --reload --port 8080
 ```
 ### Testing
 ```bash
-# Unit tests
 pytest tests/ -v
-# Integration tests
-python tests/integration/test_space_basic.py
 python tests/integration/test_tool_calls.py
 ```
 ## License
-MIT License - see [LICENSE](LICENSE) file.

 ## Deployment Options
+| Platform | Backend | Dockerfile | Use Case |
+|----------|---------|------------|----------|
+| Hugging Face Spaces | Transformers | `Dockerfile` | Development, L4 GPU |
+| Koyeb | vLLM | `Dockerfile.koyeb` | Production, L40s GPU |
 ## Features
+- OpenAI-compatible API
+- Tool/function calling support
+- Streaming responses
+- French and English financial terminology
+- Rate limiting (30 req/min, 500 req/hour)
+- Statistics tracking via `/v1/stats`
+## Quick Start
+### Chat Completion
 ```bash
 curl -X POST "https://your-endpoint/v1/chat/completions" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "DragonLLM/Qwen-Open-Finance-R-8B",
     "messages": [{"role": "user", "content": "What is compound interest?"}],
     "max_tokens": 500
   }'
 ```
 ### OpenAI SDK
 ```python
 from openai import OpenAI
+client = OpenAI(base_url="https://your-endpoint/v1", api_key="not-needed")
 response = client.chat.completions.create(
     model="DragonLLM/Qwen-Open-Finance-R-8B",
     messages=[{"role": "user", "content": "What is compound interest?"}],
 )
 ```
+## Configuration
+### Environment Variables
+| Variable | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `HF_TOKEN_LC2` | Yes | - | Hugging Face token |
+| `MODEL` | No | `DragonLLM/Qwen-Open-Finance-R-8B` | Model name |
+| `PORT` | No | `8000` (vLLM) / `7860` (Transformers) | Server port |
+### vLLM-specific (Koyeb)
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `ENABLE_AUTO_TOOL_CHOICE` | `true` | Enable tool calling |
+| `TOOL_CALL_PARSER` | `hermes` | Parser for Qwen models |
+| `MAX_MODEL_LEN` | `8192` | Max context length |
+| `GPU_MEMORY_UTILIZATION` | `0.90` | GPU memory fraction |
+## Koyeb Deployment
+Build and push the vLLM image:
+```bash
+docker build --platform linux/amd64 -f Dockerfile.koyeb -t your-registry/dragon-llm-inference:vllm-amd64 .
+docker push your-registry/dragon-llm-inference:vllm-amd64
+```
+Recommended instance: `gpu-nvidia-l40s` (48GB VRAM)
+## API Endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/v1/models` | GET | List available models |
+| `/v1/chat/completions` | POST | Chat completion |
+| `/v1/stats` | GET | Usage statistics |
+| `/health` | GET | Health check |
+## Technical Specifications
+- **Model**: DragonLLM/Qwen-Open-Finance-R-8B (8B parameters)
+- **vLLM Backend**: vllm-openai:latest with hermes tool parser
+- **Transformers Backend**: 4.45.0+ with PyTorch 2.5.0+ (CUDA 12.4)
+- **Minimum VRAM**: 20GB (L4), recommended 48GB (L40s)
 ## Development
 ```bash
 pip install -r requirements.txt
 uvicorn app.main:app --reload --port 8080
 ```
 ### Testing
 ```bash
 pytest tests/ -v
 python tests/integration/test_tool_calls.py
 ```
 ## License
+MIT License

app/providers/transformers_provider.py CHANGED Viewed

@@ -183,7 +183,7 @@ class TransformersProvider:
         pass
     async def list_models(self) -> Dict[str, Any]:
-        """List available models (matching vLLM format)."""
         return {
             "object": "list",
             "data": [
@@ -192,25 +192,9 @@ class TransformersProvider:
                     "object": "model",
                     "created": 1677610602,
                     "owned_by": "DragonLLM",
                     "root": MODEL_NAME,
                     "parent": None,
-                    "max_model_len": 32768,  # Qwen-3 8B base context window
-                    "permission": [
-                        {
-                            "id": f"modelperm-{os.urandom(12).hex()}",
-                            "object": "model_permission",
-                            "created": 1677610602,
-                            "allow_create_engine": False,
-                            "allow_sampling": True,
-                            "allow_logprobs": True,
-                            "allow_search_indices": False,
-                            "allow_view": True,
-                            "allow_fine_tuning": False,
-                            "organization": "*",
-                            "group": None,
-                            "is_blocking": False,
-                        }
-                    ],
                 }
             ]
         }
@@ -366,14 +350,11 @@ class TransformersProvider:
         # Extract token counts using tokenizer for accuracy
         # Count prompt tokens (more accurate than shape[1] as it handles special tokens correctly)
-        prompt_tokens = len(inputs["input_ids"][0])
-        generated_ids = outputs[0][inputs["input_ids"].shape[1]:]
         generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
         completion_tokens = len(generated_ids)
-        # ✅ Remove reasoning tags from all responses (Qwen reasoning models include these)
-        generated_text = self._remove_reasoning_tags(generated_text)
         # ✅ If JSON output is required, try to extract JSON from the response
         if json_output_required:
             generated_text = self._extract_json_from_text(generated_text)
@@ -402,18 +383,10 @@ class TransformersProvider:
             finish_reason=finish_reason,
         ))
-        # Build message with optional tool_calls (matching vLLM format)
-        message = {
-            "role": "assistant",
-            "content": generated_text if generated_text.strip() else None,
-            "refusal": None,
-            "annotations": None,
-            "audio": None,
-            "function_call": None,
-            "tool_calls": tool_calls if tool_calls else [],
-            "reasoning": None,
-            "reasoning_content": None,
-        }
         return {
             "id": f"chatcmpl-{os.urandom(12).hex()}",
@@ -424,23 +397,14 @@ class TransformersProvider:
                 {
                     "index": 0,
                     "message": message,
-                    "logprobs": None,
                     "finish_reason": finish_reason,
-                    "stop_reason": None,
-                    "token_ids": None,
                 }
             ],
-            "service_tier": None,
-            "system_fingerprint": None,
             "usage": {
                 "prompt_tokens": prompt_tokens,
-                "total_tokens": prompt_tokens + completion_tokens,
                 "completion_tokens": completion_tokens,
-                "prompt_tokens_details": None,
             },
-            "prompt_logprobs": None,
-            "prompt_token_ids": None,
-            "kv_transfer_params": None,
         }
     async def _chat_stream(
@@ -451,7 +415,7 @@ class TransformersProvider:
         created = int(time.time())
         # Count prompt tokens
-        prompt_tokens = len(inputs["input_ids"][0])
         completion_tokens = 0
         generated_text = ""
@@ -491,13 +455,9 @@ class TransformersProvider:
                         {
                             "index": 0,
                             "delta": {"content": token},
-                            "logprobs": None,
                             "finish_reason": None,
-                            "stop_reason": None,
                         }
                     ],
-                    "service_tier": None,
-                    "system_fingerprint": None,
                 }
                 yield f"data: {json.dumps(chunk, ensure_ascii=False)}\n\n"
                 await asyncio.sleep(0)
@@ -523,23 +483,13 @@ class TransformersProvider:
                 finish_reason=finish_reason,
             ))
-        # Send final chunk (matching vLLM format)
         final_chunk = {
             "id": completion_id,
             "object": "chat.completion.chunk",
             "created": created,
             "model": model_id,
-            "choices": [
-                {
-                    "index": 0,
-                    "delta": {},
-                    "logprobs": None,
-                    "finish_reason": "stop",
-                    "stop_reason": None,
-                }
-            ],
-            "service_tier": None,
-            "system_fingerprint": None,
         }
         yield f"data: {json.dumps(final_chunk, ensure_ascii=False)}\n\n"
         yield "data: [DONE]\n\n"
@@ -561,120 +511,60 @@ class TransformersProvider:
     def _remove_reasoning_tags(self, text: str) -> str:
         """Remove Qwen reasoning tags from text."""
-        cleaned_text = text
-        # Remove closed reasoning tags - matches <think>...</think>
         cleaned_text = re.sub(
             r'<think>.*?</think>',
             '',
-            cleaned_text,
             flags=re.DOTALL | re.IGNORECASE
         )
-        # Handle unclosed reasoning tags - find closing tag and keep everything after
-        closing_tag = "</think>"
-        if closing_tag in cleaned_text:
-            # Find the last closing tag position
-            last_closing = cleaned_text.rfind(closing_tag)
-            if last_closing != -1:
-                # Get everything after the closing tag
-                cleaned_text = cleaned_text[last_closing + len(closing_tag):].strip()
-        # If still has opening tag but no closing tag, remove everything up to and including the tag
-        opening_tag = "<think>"
-        opening_pos = cleaned_text.lower().find(opening_tag.lower())
-        if opening_pos != -1:
-            # Find the end of the opening tag
-            tag_end = cleaned_text.find(">", opening_pos)
-            if tag_end != -1:
-                # Get everything after the tag
-                after_tag = cleaned_text[tag_end + 1:].strip()
-                # The content after the tag is often still reasoning
-                # Look for patterns that indicate the start of the actual answer
-                # Strategy: Find the last sentence that doesn't contain reasoning indicators
-                # Split into sentences
-                sentences = re.split(r'([.!?]\s+)', after_tag)
-                # Recombine sentences with their punctuation
-                sentence_pairs = []
-                for i in range(0, len(sentences) - 1, 2):
-                    if i + 1 < len(sentences):
-                        sentence_pairs.append(sentences[i] + sentences[i + 1])
-                    else:
-                        sentence_pairs.append(sentences[i])
-                # Reasoning indicators - sentences starting with these are likely reasoning
-                reasoning_starters = [
-                    'okay', 'let me', 'i need to', 'first', 'let\'s see', 'the user',
-                    'i should', 'i must', 'i have to', 'let me check', 'i\'ll',
-                    'i will', 'i can', 'i want to', 'i think', 'i believe'
-                ]
-                # Find the last sentence that doesn't start with reasoning indicators
-                answer_sentence = None
-                for sentence in reversed(sentence_pairs):
-                    sentence_clean = sentence.strip()
-                    if len(sentence_clean) < 10:  # Too short, skip
-                        continue
-                    # Check if sentence starts with reasoning indicators
-                    first_words = ' '.join(sentence_clean.split()[:3]).lower()
-                    if not any(starter in first_words for starter in reasoning_starters):
-                        # This looks like an actual answer
-                        answer_sentence = sentence_clean
-                        break
-                if answer_sentence:
-                    cleaned_text = answer_sentence
-                else:
-                    # Fallback: remove the tag and take everything after, but clean it up
-                    # Remove common reasoning phrases at the start
-                    cleaned = after_tag
-                    for phrase in reasoning_starters:
-                        if cleaned.lower().startswith(phrase):
-                            # Find the end of this phrase and take what comes after
-                            words = cleaned.split()
-                            # Skip first few words that match the phrase
-                            for i, word in enumerate(words):
-                                if phrase not in ' '.join(words[:i+1]).lower():
-                                    cleaned = ' '.join(words[i:])
-                                    break
-                    cleaned_text = cleaned.strip()
-        return cleaned_text.strip()
-    def _extract_json_by_brace_matching(self, text: str, start_pos: int = 0) -> Optional[str]:
-        """Extract JSON object by matching braces starting at given position."""
-        brace_start = text.find('{', start_pos)
-        if brace_start == -1:
-            return None
-        brace_count = 0
-        in_string = False
-        escape_next = False
-        for i in range(brace_start, len(text)):
-            if escape_next:
-                escape_next = False
-                continue
-            if text[i] == '\\':
-                escape_next = True
-            elif text[i] == '"' and not in_string:
-                in_string = True
-            elif text[i] == '"' and in_string:
-                in_string = False
-            elif text[i] == '{' and not in_string:
-                brace_count += 1
-            elif text[i] == '}' and not in_string:
-                brace_count -= 1
-                if brace_count == 0:
-                    json_candidate = text[brace_start:i+1]
-                    try:
-                        json.loads(json_candidate)
-                        return json_candidate
-                    except json.JSONDecodeError:
-                        return None
         return None
     def _format_tools_for_prompt(self, tools: List[Dict[str, Any]]) -> str:
         """Format tools for inclusion in system prompt."""
         tools_text = (

         pass
     async def list_models(self) -> Dict[str, Any]:
+        """List available models."""
         return {
             "object": "list",
             "data": [
                     "object": "model",
                     "created": 1677610602,
                     "owned_by": "DragonLLM",
+                    "permission": [],
                     "root": MODEL_NAME,
                     "parent": None,
                 }
             ]
         }
         # Extract token counts using tokenizer for accuracy
         # Count prompt tokens (more accurate than shape[1] as it handles special tokens correctly)
+        prompt_tokens = len(inputs.input_ids[0])
+        generated_ids = outputs[0][inputs.input_ids.shape[1]:]
         generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
         completion_tokens = len(generated_ids)
         # ✅ If JSON output is required, try to extract JSON from the response
         if json_output_required:
             generated_text = self._extract_json_from_text(generated_text)
             finish_reason=finish_reason,
         ))
+        # Build message with optional tool_calls
+        message = {"role": "assistant", "content": generated_text if generated_text.strip() else None}
+        if tool_calls:
+            message["tool_calls"] = tool_calls
         return {
             "id": f"chatcmpl-{os.urandom(12).hex()}",
                 {
                     "index": 0,
                     "message": message,
                     "finish_reason": finish_reason,
                 }
             ],
             "usage": {
                 "prompt_tokens": prompt_tokens,
                 "completion_tokens": completion_tokens,
+                "total_tokens": prompt_tokens + completion_tokens,
             },
         }
     async def _chat_stream(
         created = int(time.time())
         # Count prompt tokens
+        prompt_tokens = len(inputs.input_ids[0])
         completion_tokens = 0
         generated_text = ""
                         {
                             "index": 0,
                             "delta": {"content": token},
                             "finish_reason": None,
                         }
                     ],
                 }
                 yield f"data: {json.dumps(chunk, ensure_ascii=False)}\n\n"
                 await asyncio.sleep(0)
                 finish_reason=finish_reason,
             ))
+        # Send final chunk
         final_chunk = {
             "id": completion_id,
             "object": "chat.completion.chunk",
             "created": created,
             "model": model_id,
+            "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}],
         }
         yield f"data: {json.dumps(final_chunk, ensure_ascii=False)}\n\n"
         yield "data: [DONE]\n\n"
     def _remove_reasoning_tags(self, text: str) -> str:
         """Remove Qwen reasoning tags from text."""
+        # Remove reasoning tags - matches <think>...</think>
         cleaned_text = re.sub(
             r'<think>.*?</think>',
             '',
+            text,
             flags=re.DOTALL | re.IGNORECASE
         )
+        # Handle unclosed reasoning tags (split on closing tag)
+        if "</think>" in cleaned_text:
+            parts = cleaned_text.split("</think>", 1)
+            if len(parts) > 1:
+                cleaned_text = parts[1].strip()
+        # If still has opening tag but no closing, remove everything before first {
+        if "<think>" in cleaned_text.lower() and "{" in cleaned_text:
+            brace_pos = cleaned_text.find('{')
+            if brace_pos != -1:
+                cleaned_text = cleaned_text[brace_pos:]
+        return cleaned_text
+def _extract_json_by_brace_matching(self, text: str, start_pos: int = 0) -> Optional[str]:
+    """Extract JSON object by matching braces starting at given position."""
+    brace_start = text.find('{', start_pos)
+    if brace_start == -1:
         return None
+    brace_count = 0
+    in_string = False
+    escape_next = False
+    for i in range(brace_start, len(text)):
+        if escape_next:
+            escape_next = False
+            continue
+        if text[i] == '\\':
+            escape_next = True
+        elif text[i] == '"' and not in_string:
+            in_string = True
+        elif text[i] == '"' and in_string:
+            in_string = False
+        elif text[i] == '{' and not in_string:
+            brace_count += 1
+        elif text[i] == '}' and not in_string:
+            brace_count -= 1
+            if brace_count == 0:
+                json_candidate = text[brace_start:i+1]
+                try:
+                    json.loads(json_candidate)
+                    return json_candidate
+                except json.JSONDecodeError:
+                    return None
+    return None
     def _format_tools_for_prompt(self, tools: List[Dict[str, Any]]) -> str:
         """Format tools for inclusion in system prompt."""
         tools_text = (

docs/STRUCTURED_OUTPUTS_COMPARISON.md DELETED Viewed

@@ -1,132 +0,0 @@
-# Structured Outputs: vLLM vs PydanticAI Comparison
-## Overview
-This document compares how vLLM and PydanticAI handle structured outputs, and why they may not be fully compatible.
-## vLLM Structured Outputs
-### Method
-vLLM uses **`extra_body`** parameter with `structured_outputs` key (NOT standard OpenAI `response_format`):
-```python
-completion = client.chat.completions.create(
-    model="DragonLLM/Qwen-Open-Finance-R-8B",
-    messages=[{"role": "user", "content": "Generate JSON..."}],
-    extra_body={
-        "structured_outputs": {
-            "json": json_schema  # Pydantic model.model_json_schema()
-        }
-    }
-)
-```
-### Supported Formats
-1. **JSON Schema**: `{"json": json_schema}`
-2. **Regex**: `{"regex": r"pattern"}`
-3. **Choice**: `{"choice": ["option1", "option2"]}`
-4. **Grammar**: `{"grammar": "CFG definition"}`
-### Response Format
-- Returns JSON string in `message.content`
-- No tool calls involved
-- Direct JSON in content field
-## PydanticAI Structured Outputs
-### Method
-PydanticAI uses **tool calling** with `tool_choice="required"`:
-```python
-agent = Agent(model, system_prompt="...")
-result = await agent.run(prompt, output_type=Portfolio)
-```
-### How It Works
-1. PydanticAI converts `output_type` (Pydantic model) to a tool definition
-2. Sends request with:
-   - `tools`: [function definition matching the schema]
-   - `tool_choice`: `"required"` (forces tool call)
-3. Expects response with `tool_calls` array
-4. Extracts JSON from `tool_calls[0].function.arguments`
-### Expected Response Format
-```json
-{
-  "choices": [{
-    "message": {
-      "tool_calls": [{
-        "function": {
-          "name": "...",
-          "arguments": "{\"field\": \"value\"}"  // JSON string
-        }
-      }]
-    }
-  }]
-}
-```
-## Compatibility Issue
-### Problem
-- **vLLM**: Uses `extra_body.structured_outputs` → Returns JSON in `message.content`
-- **PydanticAI**: Uses `tools` + `tool_choice="required"` → Expects JSON in `tool_calls[].function.arguments`
-### Current Status
-- ✅ **HF Space**: Works because it implements tool calling support
-- ❌ **vLLM**: Fails because vLLM's structured outputs return JSON in `content`, not `tool_calls`
-## Solutions
-### Option 1: Use vLLM's `extra_body` (Recommended)
-Modify PydanticAI's OpenAI provider to detect vLLM and use `extra_body` instead of tools:
-```python
-# In PydanticAI OpenAI provider
-if output_type:
-    json_schema = output_type.model_json_schema()
-    # Use vLLM structured_outputs instead of tools
-    extra_body = {
-        "structured_outputs": {"json": json_schema}
-    }
-```
-### Option 2: Add Tool Call Support to vLLM Response
-When vLLM receives `tools` + `tool_choice="required"`, wrap the structured output in a tool call format.
-### Option 3: Use `response_format` (Limited)
-Standard OpenAI `response_format={"type": "json_object"}` works but:
-- Only enforces JSON, not schema validation
-- PydanticAI would need to parse and validate manually
-- Less reliable than schema-based approaches
-## Current Implementation Status
-### HF Space (Transformers)
-- ✅ Supports tool calling (text-based parsing)
-- ✅ Supports `response_format`
-- ✅ Works with PydanticAI's tool-based approach
-### vLLM
-- ✅ Supports `extra_body.structured_outputs` (JSON schema)
-- ❌ Does NOT support tool calling for structured outputs
-- ✅ Supports `response_format` (basic JSON mode only)
-## Recommendation
-For full compatibility with PydanticAI, we need to:
-1. **Detect vLLM endpoint** in PydanticAI provider
-2. **Use `extra_body.structured_outputs`** instead of tools when using vLLM
-3. **Parse `message.content`** instead of `tool_calls` for vLLM responses
-Alternatively, implement a middleware in the HF Space API that:
-- Detects `tools` + `tool_choice="required"` requests
-- Converts to `extra_body.structured_outputs` for vLLM
-- Wraps response in tool call format for PydanticAI compatibility
-## References
-- [vLLM Structured Outputs Docs](https://docs.vllm.ai/en/stable/features/structured_outputs/)
-- [PydanticAI Documentation](https://ai.pydantic.dev/)

start-vllm.sh CHANGED Viewed

@@ -1,6 +1,7 @@
 #!/bin/bash
 # vLLM OpenAI-compatible API server startup script
-# This script ensures args are always passed, even if Koyeb clears CMD
 set -e
@@ -10,6 +11,7 @@ PORT="${PORT:-8000}"
 MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
 GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.90}"
 DTYPE="${DTYPE:-bfloat16}"
 # HF Token - HF_TOKEN_LC2 is the model access token (priority)
 export HF_TOKEN="${HF_TOKEN_LC2:-${HF_TOKEN:-${HUGGING_FACE_HUB_TOKEN:-}}}"
@@ -22,31 +24,37 @@ echo "Model: $MODEL"
 echo "Port: $PORT"
 echo "Max Model Len: $MAX_MODEL_LEN"
 echo "GPU Memory Utilization: $GPU_MEMORY_UTILIZATION"
 echo "HF Token: ${HF_TOKEN:+set (${#HF_TOKEN} chars)}"
 echo "=========================================="
-# Execute vLLM server (use python3, not python)
-# Enable tool calling support for OpenAI-compatible API
-# For Qwen3 models, valid parsers are: qwen3_coder, qwen3_xml
-# If TOOL_CALL_PARSER is not set, use --enable-auto-tool-choice only
 VLLM_ARGS=(
     --model "$MODEL"
     --trust-remote-code
     --dtype "$DTYPE"
     --max-model-len "$MAX_MODEL_LEN"
     --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION"
     --port "$PORT"
     --host 0.0.0.0
-    --enable-auto-tool-choice
 )
-# Add tool-call-parser only if explicitly specified
-# For Qwen3 models, use: qwen3_xml or qwen3_coder
-if [ -n "${TOOL_CALL_PARSER:-}" ]; then
-    VLLM_ARGS+=(--tool-call-parser "$TOOL_CALL_PARSER")
-    echo "Tool Calling: ENABLED (auto-tool-choice, parser: $TOOL_CALL_PARSER)"
 else
-    echo "Tool Calling: ENABLED (auto-tool-choice only, no parser)"
 fi
 exec python3 -m vllm.entrypoints.openai.api_server "${VLLM_ARGS[@]}"

 #!/bin/bash
 # vLLM OpenAI-compatible API server startup script
+# Compatible with Koyeb GPU deployment patterns
+# Based on Koyeb's one-click vLLM + Qwen deployment templates
 set -e
 MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
 GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.90}"
 DTYPE="${DTYPE:-bfloat16}"
+TENSOR_PARALLEL_SIZE="${TENSOR_PARALLEL_SIZE:-${KOYEB_GPU_COUNT:-1}}"
 # HF Token - HF_TOKEN_LC2 is the model access token (priority)
 export HF_TOKEN="${HF_TOKEN_LC2:-${HF_TOKEN:-${HUGGING_FACE_HUB_TOKEN:-}}}"
 echo "Port: $PORT"
 echo "Max Model Len: $MAX_MODEL_LEN"
 echo "GPU Memory Utilization: $GPU_MEMORY_UTILIZATION"
+echo "Tensor Parallel Size: $TENSOR_PARALLEL_SIZE"
 echo "HF Token: ${HF_TOKEN:+set (${#HF_TOKEN} chars)}"
 echo "=========================================="
+# Build vLLM arguments
 VLLM_ARGS=(
     --model "$MODEL"
     --trust-remote-code
     --dtype "$DTYPE"
     --max-model-len "$MAX_MODEL_LEN"
     --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION"
+    --tensor-parallel-size "$TENSOR_PARALLEL_SIZE"
     --port "$PORT"
     --host 0.0.0.0
 )
+# Tool Calling Support
+# ENABLED BY DEFAULT for Qwen models (using hermes parser)
+# Set ENABLE_AUTO_TOOL_CHOICE=false to disable
+# For Qwen models, the default parser is 'hermes'
+ENABLE_AUTO_TOOL_CHOICE="${ENABLE_AUTO_TOOL_CHOICE:-true}"
+TOOL_CALL_PARSER="${TOOL_CALL_PARSER:-hermes}"
+if [ "${ENABLE_AUTO_TOOL_CHOICE}" = "true" ]; then
+    VLLM_ARGS+=(--enable-auto-tool-choice --tool-call-parser "$TOOL_CALL_PARSER")
+    echo "Tool Calling: ENABLED (parser: $TOOL_CALL_PARSER)"
 else
+    echo "Tool Calling: DISABLED"
 fi
+echo "=========================================="
+# Execute vLLM server
 exec python3 -m vllm.entrypoints.openai.api_server "${VLLM_ARGS[@]}"

start.sh DELETED Viewed

@@ -1,10 +0,0 @@
-#!/bin/bash
-# Get port from environment variable, default to 7860
-PORT=${PORT:-7860}
-# Redirect all output to stderr so it shows in logs
-exec >&2
-# Start uvicorn with the specified port
-exec python -m uvicorn app.main:app --host 0.0.0.0 --port "$PORT"