jeanbaptdzd commited on
Commit
7239fe3
·
1 Parent(s): c495666

fix: vLLM tool calling - enable by default with hermes parser

Browse files

- Fix --enable-auto-tool-choice requires --tool-call-parser error
- Default TOOL_CALL_PARSER=hermes for Qwen models
- Default ENABLE_AUTO_TOOL_CHOICE=true
- Update Dockerfile.koyeb with vLLM backend
- Clean up deprecated files
- Update README with deployment options

Dockerfile CHANGED
@@ -68,23 +68,14 @@ RUN test -f /app/app/providers/transformers_provider.py && \
68
  grep -q "def initialize_model" /app/app/providers/transformers_provider.py || \
69
  (echo "ERROR: transformers_provider.py not found or invalid!" && exit 1)
70
 
71
- # Copy startup script
72
- COPY start.sh /app/start.sh
73
-
74
  # Create non-root user and cache directories in single layer
75
  # Use ${HF_HOME} variable (defaults to /tmp/huggingface if not set)
76
  RUN useradd -m -u 1000 user && \
77
  mkdir -p ${HF_HOME:-/tmp/huggingface} /tmp/torch/inductor /tmp/triton && \
78
- chmod +x /app/start.sh && \
79
- chown -R user:user /app ${HF_HOME:-/tmp/huggingface} /tmp/torch /tmp/triton && \
80
- # Verify startup script is executable and has correct shebang
81
- test -x /app/start.sh && head -1 /app/start.sh | grep -q "^#!/bin/bash" || (echo "ERROR: start.sh not executable or wrong shebang!" && exit 1)
82
 
83
  USER user
84
 
85
- # Expose ports for both HF Spaces (7860) and Koyeb (8000)
86
- # PORT environment variable controls which port the app actually uses
87
- EXPOSE 7860 8000
88
 
89
- # Use startup script for more reliable execution
90
- CMD ["/app/start.sh"]
 
68
  grep -q "def initialize_model" /app/app/providers/transformers_provider.py || \
69
  (echo "ERROR: transformers_provider.py not found or invalid!" && exit 1)
70
 
 
 
 
71
  # Create non-root user and cache directories in single layer
72
  # Use ${HF_HOME} variable (defaults to /tmp/huggingface if not set)
73
  RUN useradd -m -u 1000 user && \
74
  mkdir -p ${HF_HOME:-/tmp/huggingface} /tmp/torch/inductor /tmp/triton && \
75
+ chown -R user:user /app ${HF_HOME:-/tmp/huggingface} /tmp/torch /tmp/triton
 
 
 
76
 
77
  USER user
78
 
79
+ EXPOSE 7860
 
 
80
 
81
+ CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "7860"]
 
Dockerfile.koyeb CHANGED
@@ -1,4 +1,5 @@
1
  # Koyeb-optimized Dockerfile using official vLLM OpenAI image
 
2
  # Uses ENTRYPOINT to ensure args aren't overridden by Koyeb
3
 
4
  FROM vllm/vllm-openai:latest
@@ -19,3 +20,4 @@ EXPOSE 8000
19
 
20
  # Use ENTRYPOINT so it can't be overridden by empty Koyeb args
21
  ENTRYPOINT ["/start-vllm.sh"]
 
 
1
  # Koyeb-optimized Dockerfile using official vLLM OpenAI image
2
+ # Compatible with Koyeb's one-click deployment patterns for Qwen + vLLM
3
  # Uses ENTRYPOINT to ensure args aren't overridden by Koyeb
4
 
5
  FROM vllm/vllm-openai:latest
 
20
 
21
  # Use ENTRYPOINT so it can't be overridden by empty Koyeb args
22
  ENTRYPOINT ["/start-vllm.sh"]
23
+
KOYEB_VLLM_DEPLOYMENT.md DELETED
@@ -1,93 +0,0 @@
1
- # Koyeb vLLM Deployment
2
-
3
- ## Overview
4
-
5
- The Koyeb deployment uses **vLLM's native OpenAI-compatible API server** with full CUDA optimizations.
6
-
7
- ## Docker Image
8
-
9
- **Public image on Docker Hub:**
10
- ```
11
- jeanbapt/dragon-llm-inference:vllm-amd64
12
- ```
13
-
14
- **Important:** Must be built with `--platform linux/amd64` for Koyeb GPU instances.
15
-
16
- Built from `Dockerfile.koyeb` with:
17
- - Base: `vllm/vllm-openai:latest`
18
- - Custom startup script for env var configuration
19
- - Flash Attention 2, PagedAttention, continuous batching
20
-
21
- ## Koyeb Configuration
22
-
23
- ### Environment Variables
24
-
25
- | Variable | Value | Description |
26
- |----------|-------|-------------|
27
- | `HF_TOKEN_LC2` | (secret) | Hugging Face token for model access |
28
- | `MODEL` | `DragonLLM/Qwen-Open-Finance-R-8B` | Model to load |
29
- | `PORT` | `8000` | Server port |
30
- | `MAX_MODEL_LEN` | `8192` | Max context length |
31
- | `GPU_MEMORY_UTILIZATION` | `0.90` | GPU memory usage |
32
-
33
- ### Instance Type
34
-
35
- - **Recommended**: `gpu-nvidia-l40s` (48GB VRAM) in Iowa (`dsm`)
36
- - **Alternative**: `gpu-nvidia-rtx-4000-sff-ada` (20GB VRAM) in Frankfurt (`fra`)
37
-
38
- ### Health Check
39
-
40
- - **Type**: TCP
41
- - **Port**: 8000
42
- - **Grace Period**: 900 seconds (15 minutes for model loading)
43
-
44
- ## API Endpoints (vLLM Native)
45
-
46
- ```
47
- POST /v1/chat/completions - Chat completions (OpenAI compatible)
48
- POST /v1/completions - Text completions
49
- GET /v1/models - List models
50
- GET /health - Health check
51
- ```
52
-
53
- ## Usage Example
54
-
55
- ```python
56
- from openai import OpenAI
57
-
58
- client = OpenAI(
59
- base_url="https://dragon-llm-open-finance-inference.koyeb.app/v1",
60
- api_key="not-needed"
61
- )
62
-
63
- response = client.chat.completions.create(
64
- model="DragonLLM/Qwen-Open-Finance-R-8B",
65
- messages=[
66
- {"role": "user", "content": "Analyze the impact of rising interest rates"}
67
- ],
68
- temperature=0.7,
69
- max_tokens=1024
70
- )
71
- ```
72
-
73
- ## Build & Push
74
-
75
- ```bash
76
- # Build for linux/amd64 (required for Koyeb GPU)
77
- docker buildx build --platform linux/amd64 \
78
- -f Dockerfile.koyeb \
79
- -t jeanbapt/dragon-llm-inference:vllm-amd64 \
80
- --push .
81
- ```
82
-
83
- ## Troubleshooting
84
-
85
- ### "Application exited with code 8" with no logs
86
-
87
- 1. **Wrong architecture**: Ensure image is built for `linux/amd64`, not ARM
88
- 2. **GPU allocation failed**: Try different region or GPU type
89
- 3. **Container crash**: Check if `python3` is used (not `python`)
90
-
91
- ### Model download issues
92
-
93
- Ensure `HF_TOKEN_LC2` is set with access to the model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -15,93 +15,38 @@ OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B.
15
 
16
  ## Deployment Options
17
 
18
- | Platform | Backend | Docker Image | Port |
19
- |----------|---------|--------------|------|
20
- | **HF Spaces** | Transformers | Default (builds from `Dockerfile`) | 7860 |
21
- | **Koyeb** | vLLM (optimized) | `jeanbapt/dragon-llm-inference:vllm` | 8000 |
22
-
23
- ### Docker Hub Public Images
24
-
25
- ```
26
- jeanbapt/dragon-llm-inference:vllm-amd64 # Koyeb - vLLM with CUDA optimizations (linux/amd64)
27
- jeanbapt/dragon-llm-inference:latest # HF Spaces - Transformers backend
28
- ```
29
 
30
  ## Features
31
 
32
- - **OpenAI-compatible API** - Drop-in replacement for OpenAI SDK
33
- - **French and English support** - Automatic language detection
34
- - **Rate limiting** - Built-in protection (30 req/min, 500 req/hour)
35
- - **Statistics tracking** - Token usage and request metrics via `/v1/stats`
36
- - **Health monitoring** - Model readiness status in `/health` endpoint
37
- - **Streaming support** - Real-time response streaming
38
- - **Tool calls support** - OpenAI-compatible tool/function calling
39
- - **Structured outputs** - JSON format support via `response_format`
40
 
41
- ## API Endpoints
42
 
43
- ### Chat Completions
44
  ```bash
45
  curl -X POST "https://your-endpoint/v1/chat/completions" \
46
  -H "Content-Type: application/json" \
47
  -d '{
48
  "model": "DragonLLM/Qwen-Open-Finance-R-8B",
49
  "messages": [{"role": "user", "content": "What is compound interest?"}],
50
- "temperature": 0.7,
51
  "max_tokens": 500
52
  }'
53
  ```
54
 
55
- ### List Models
56
- ```bash
57
- curl -X GET "https://your-endpoint/v1/models"
58
- ```
59
-
60
- ### Streaming
61
- ```bash
62
- curl -X POST "https://your-endpoint/v1/chat/completions" \
63
- -H "Content-Type: application/json" \
64
- -d '{
65
- "model": "DragonLLM/Qwen-Open-Finance-R-8B",
66
- "messages": [{"role": "user", "content": "Explain Value at Risk"}],
67
- "stream": true
68
- }'
69
- ```
70
-
71
- ### Health Check
72
- ```bash
73
- curl -X GET "https://your-endpoint/health"
74
- ```
75
-
76
- ## Configuration
77
-
78
- ### Environment Variables
79
-
80
- **Required:**
81
- - `HF_TOKEN_LC2` - Hugging Face token with access to DragonLLM models
82
-
83
- **Optional:**
84
- - `MODEL` - Model name (default: `DragonLLM/Qwen-Open-Finance-R-8B`)
85
- - `PORT` - Server port (default: 7860 for HF, 8000 for Koyeb)
86
- - `SERVICE_API_KEY` - API key for authentication
87
- - `LOG_LEVEL` - Logging level (default: `info`)
88
-
89
- Token priority: `HF_TOKEN_LC2` > `HF_TOKEN_LC` > `HF_TOKEN` > `HUGGING_FACE_HUB_TOKEN`
90
-
91
- **Note:** Accept model terms at https://huggingface.co/DragonLLM/Qwen-Open-Finance-R-8B before use.
92
-
93
- ## Integration
94
-
95
  ### OpenAI SDK
96
-
97
  ```python
98
  from openai import OpenAI
99
 
100
- client = OpenAI(
101
- base_url="https://your-endpoint/v1",
102
- api_key="not-needed" # or your SERVICE_API_KEY
103
- )
104
-
105
  response = client.chat.completions.create(
106
  model="DragonLLM/Qwen-Open-Finance-R-8B",
107
  messages=[{"role": "user", "content": "What is compound interest?"}],
@@ -109,86 +54,64 @@ response = client.chat.completions.create(
109
  )
110
  ```
111
 
112
- ## Koyeb Deployment (vLLM)
113
-
114
- The Koyeb deployment uses vLLM's native OpenAI-compatible server with full CUDA optimizations:
115
 
116
- - **Flash Attention 2** - Faster attention computation
117
- - **PagedAttention** - Efficient GPU memory management
118
- - **Continuous batching** - High throughput inference
119
- - **Prefix caching** - Reuse KV cache for common prefixes
120
 
121
- See [KOYEB_VLLM_DEPLOYMENT.md](KOYEB_VLLM_DEPLOYMENT.md) for detailed setup.
 
 
 
 
122
 
123
- ### Quick Deploy to Koyeb
124
 
125
- 1. Create app in Koyeb dashboard
126
- 2. Set Docker image: `jeanbapt/dragon-llm-inference:vllm`
127
- 3. Add environment variables:
128
- - `MODEL`: `DragonLLM/Qwen-Open-Finance-R-8B`
129
- - `HF_TOKEN_LC2`: (your HF token as secret)
130
- - `PORT`: `8000`
131
- 4. Select GPU instance (L40s recommended)
132
- 5. Set health check: `GET /health` on port 8000
133
 
134
- ## Technical Specifications
135
 
136
- **Model:**
137
- - DragonLLM/Qwen-Open-Finance-R-8B (8B parameters)
138
- - Fine-tuned on financial data
139
- - English and French support
 
140
 
141
- **HF Spaces Backend:**
142
- - Transformers 4.45.0+
143
- - PyTorch 2.5.0+ (CUDA 12.4)
144
 
145
- **Koyeb Backend:**
146
- - vLLM 0.6.0+
147
- - Flash Attention 2
148
- - CUDA 12.4
149
 
150
- **Hardware:**
151
- - Minimum: L4 GPU (24GB VRAM)
152
- - Recommended: L40s GPU (48GB VRAM)
 
 
 
153
 
154
- ## Project Structure
155
 
156
- ```
157
- .
158
- ├── app/ # Main API application
159
- │ ├── main.py # FastAPI app (HF Spaces)
160
- │ ├── routers/ # API routes
161
- │ ├── providers/ # Model providers (Transformers)
162
- │ ├── middleware/ # Rate limiting, auth
163
- │ └── utils/ # Utilities, stats tracking
164
- ├── Dockerfile # HF Spaces (Transformers)
165
- ├── Dockerfile.koyeb # Koyeb (vLLM)
166
- ├── start.sh # HF Spaces startup
167
- ├── start-vllm.sh # Koyeb vLLM startup
168
- ├── docs/ # Technical documentation
169
- └── tests/ # Test suite
170
- ```
171
 
172
  ## Development
173
 
174
- ### Local Setup
175
-
176
  ```bash
177
  pip install -r requirements.txt
178
  uvicorn app.main:app --reload --port 8080
179
  ```
180
 
181
  ### Testing
182
-
183
  ```bash
184
- # Unit tests
185
  pytest tests/ -v
186
-
187
- # Integration tests
188
- python tests/integration/test_space_basic.py
189
  python tests/integration/test_tool_calls.py
190
  ```
191
 
192
  ## License
193
 
194
- MIT License - see [LICENSE](LICENSE) file.
 
15
 
16
  ## Deployment Options
17
 
18
+ | Platform | Backend | Dockerfile | Use Case |
19
+ |----------|---------|------------|----------|
20
+ | Hugging Face Spaces | Transformers | `Dockerfile` | Development, L4 GPU |
21
+ | Koyeb | vLLM | `Dockerfile.koyeb` | Production, L40s GPU |
 
 
 
 
 
 
 
22
 
23
  ## Features
24
 
25
+ - OpenAI-compatible API
26
+ - Tool/function calling support
27
+ - Streaming responses
28
+ - French and English financial terminology
29
+ - Rate limiting (30 req/min, 500 req/hour)
30
+ - Statistics tracking via `/v1/stats`
 
 
31
 
32
+ ## Quick Start
33
 
34
+ ### Chat Completion
35
  ```bash
36
  curl -X POST "https://your-endpoint/v1/chat/completions" \
37
  -H "Content-Type: application/json" \
38
  -d '{
39
  "model": "DragonLLM/Qwen-Open-Finance-R-8B",
40
  "messages": [{"role": "user", "content": "What is compound interest?"}],
 
41
  "max_tokens": 500
42
  }'
43
  ```
44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  ### OpenAI SDK
 
46
  ```python
47
  from openai import OpenAI
48
 
49
+ client = OpenAI(base_url="https://your-endpoint/v1", api_key="not-needed")
 
 
 
 
50
  response = client.chat.completions.create(
51
  model="DragonLLM/Qwen-Open-Finance-R-8B",
52
  messages=[{"role": "user", "content": "What is compound interest?"}],
 
54
  )
55
  ```
56
 
57
+ ## Configuration
 
 
58
 
59
+ ### Environment Variables
 
 
 
60
 
61
+ | Variable | Required | Default | Description |
62
+ |----------|----------|---------|-------------|
63
+ | `HF_TOKEN_LC2` | Yes | - | Hugging Face token |
64
+ | `MODEL` | No | `DragonLLM/Qwen-Open-Finance-R-8B` | Model name |
65
+ | `PORT` | No | `8000` (vLLM) / `7860` (Transformers) | Server port |
66
 
67
+ ### vLLM-specific (Koyeb)
68
 
69
+ | Variable | Default | Description |
70
+ |----------|---------|-------------|
71
+ | `ENABLE_AUTO_TOOL_CHOICE` | `true` | Enable tool calling |
72
+ | `TOOL_CALL_PARSER` | `hermes` | Parser for Qwen models |
73
+ | `MAX_MODEL_LEN` | `8192` | Max context length |
74
+ | `GPU_MEMORY_UTILIZATION` | `0.90` | GPU memory fraction |
 
 
75
 
76
+ ## Koyeb Deployment
77
 
78
+ Build and push the vLLM image:
79
+ ```bash
80
+ docker build --platform linux/amd64 -f Dockerfile.koyeb -t your-registry/dragon-llm-inference:vllm-amd64 .
81
+ docker push your-registry/dragon-llm-inference:vllm-amd64
82
+ ```
83
 
84
+ Recommended instance: `gpu-nvidia-l40s` (48GB VRAM)
 
 
85
 
86
+ ## API Endpoints
 
 
 
87
 
88
+ | Endpoint | Method | Description |
89
+ |----------|--------|-------------|
90
+ | `/v1/models` | GET | List available models |
91
+ | `/v1/chat/completions` | POST | Chat completion |
92
+ | `/v1/stats` | GET | Usage statistics |
93
+ | `/health` | GET | Health check |
94
 
95
+ ## Technical Specifications
96
 
97
+ - **Model**: DragonLLM/Qwen-Open-Finance-R-8B (8B parameters)
98
+ - **vLLM Backend**: vllm-openai:latest with hermes tool parser
99
+ - **Transformers Backend**: 4.45.0+ with PyTorch 2.5.0+ (CUDA 12.4)
100
+ - **Minimum VRAM**: 20GB (L4), recommended 48GB (L40s)
 
 
 
 
 
 
 
 
 
 
 
101
 
102
  ## Development
103
 
 
 
104
  ```bash
105
  pip install -r requirements.txt
106
  uvicorn app.main:app --reload --port 8080
107
  ```
108
 
109
  ### Testing
 
110
  ```bash
 
111
  pytest tests/ -v
 
 
 
112
  python tests/integration/test_tool_calls.py
113
  ```
114
 
115
  ## License
116
 
117
+ MIT License
app/providers/transformers_provider.py CHANGED
@@ -183,7 +183,7 @@ class TransformersProvider:
183
  pass
184
 
185
  async def list_models(self) -> Dict[str, Any]:
186
- """List available models (matching vLLM format)."""
187
  return {
188
  "object": "list",
189
  "data": [
@@ -192,25 +192,9 @@ class TransformersProvider:
192
  "object": "model",
193
  "created": 1677610602,
194
  "owned_by": "DragonLLM",
 
195
  "root": MODEL_NAME,
196
  "parent": None,
197
- "max_model_len": 32768, # Qwen-3 8B base context window
198
- "permission": [
199
- {
200
- "id": f"modelperm-{os.urandom(12).hex()}",
201
- "object": "model_permission",
202
- "created": 1677610602,
203
- "allow_create_engine": False,
204
- "allow_sampling": True,
205
- "allow_logprobs": True,
206
- "allow_search_indices": False,
207
- "allow_view": True,
208
- "allow_fine_tuning": False,
209
- "organization": "*",
210
- "group": None,
211
- "is_blocking": False,
212
- }
213
- ],
214
  }
215
  ]
216
  }
@@ -366,14 +350,11 @@ class TransformersProvider:
366
 
367
  # Extract token counts using tokenizer for accuracy
368
  # Count prompt tokens (more accurate than shape[1] as it handles special tokens correctly)
369
- prompt_tokens = len(inputs["input_ids"][0])
370
- generated_ids = outputs[0][inputs["input_ids"].shape[1]:]
371
  generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
372
  completion_tokens = len(generated_ids)
373
 
374
- # ✅ Remove reasoning tags from all responses (Qwen reasoning models include these)
375
- generated_text = self._remove_reasoning_tags(generated_text)
376
-
377
  # ✅ If JSON output is required, try to extract JSON from the response
378
  if json_output_required:
379
  generated_text = self._extract_json_from_text(generated_text)
@@ -402,18 +383,10 @@ class TransformersProvider:
402
  finish_reason=finish_reason,
403
  ))
404
 
405
- # Build message with optional tool_calls (matching vLLM format)
406
- message = {
407
- "role": "assistant",
408
- "content": generated_text if generated_text.strip() else None,
409
- "refusal": None,
410
- "annotations": None,
411
- "audio": None,
412
- "function_call": None,
413
- "tool_calls": tool_calls if tool_calls else [],
414
- "reasoning": None,
415
- "reasoning_content": None,
416
- }
417
 
418
  return {
419
  "id": f"chatcmpl-{os.urandom(12).hex()}",
@@ -424,23 +397,14 @@ class TransformersProvider:
424
  {
425
  "index": 0,
426
  "message": message,
427
- "logprobs": None,
428
  "finish_reason": finish_reason,
429
- "stop_reason": None,
430
- "token_ids": None,
431
  }
432
  ],
433
- "service_tier": None,
434
- "system_fingerprint": None,
435
  "usage": {
436
  "prompt_tokens": prompt_tokens,
437
- "total_tokens": prompt_tokens + completion_tokens,
438
  "completion_tokens": completion_tokens,
439
- "prompt_tokens_details": None,
440
  },
441
- "prompt_logprobs": None,
442
- "prompt_token_ids": None,
443
- "kv_transfer_params": None,
444
  }
445
 
446
  async def _chat_stream(
@@ -451,7 +415,7 @@ class TransformersProvider:
451
  created = int(time.time())
452
 
453
  # Count prompt tokens
454
- prompt_tokens = len(inputs["input_ids"][0])
455
  completion_tokens = 0
456
  generated_text = ""
457
 
@@ -491,13 +455,9 @@ class TransformersProvider:
491
  {
492
  "index": 0,
493
  "delta": {"content": token},
494
- "logprobs": None,
495
  "finish_reason": None,
496
- "stop_reason": None,
497
  }
498
  ],
499
- "service_tier": None,
500
- "system_fingerprint": None,
501
  }
502
  yield f"data: {json.dumps(chunk, ensure_ascii=False)}\n\n"
503
  await asyncio.sleep(0)
@@ -523,23 +483,13 @@ class TransformersProvider:
523
  finish_reason=finish_reason,
524
  ))
525
 
526
- # Send final chunk (matching vLLM format)
527
  final_chunk = {
528
  "id": completion_id,
529
  "object": "chat.completion.chunk",
530
  "created": created,
531
  "model": model_id,
532
- "choices": [
533
- {
534
- "index": 0,
535
- "delta": {},
536
- "logprobs": None,
537
- "finish_reason": "stop",
538
- "stop_reason": None,
539
- }
540
- ],
541
- "service_tier": None,
542
- "system_fingerprint": None,
543
  }
544
  yield f"data: {json.dumps(final_chunk, ensure_ascii=False)}\n\n"
545
  yield "data: [DONE]\n\n"
@@ -561,120 +511,60 @@ class TransformersProvider:
561
 
562
  def _remove_reasoning_tags(self, text: str) -> str:
563
  """Remove Qwen reasoning tags from text."""
564
- cleaned_text = text
565
-
566
- # Remove closed reasoning tags - matches <think>...</think>
567
  cleaned_text = re.sub(
568
  r'<think>.*?</think>',
569
  '',
570
- cleaned_text,
571
  flags=re.DOTALL | re.IGNORECASE
572
  )
573
 
574
- # Handle unclosed reasoning tags - find closing tag and keep everything after
575
- closing_tag = "</think>"
576
- if closing_tag in cleaned_text:
577
- # Find the last closing tag position
578
- last_closing = cleaned_text.rfind(closing_tag)
579
- if last_closing != -1:
580
- # Get everything after the closing tag
581
- cleaned_text = cleaned_text[last_closing + len(closing_tag):].strip()
582
-
583
- # If still has opening tag but no closing tag, remove everything up to and including the tag
584
- opening_tag = "<think>"
585
- opening_pos = cleaned_text.lower().find(opening_tag.lower())
586
- if opening_pos != -1:
587
- # Find the end of the opening tag
588
- tag_end = cleaned_text.find(">", opening_pos)
589
- if tag_end != -1:
590
- # Get everything after the tag
591
- after_tag = cleaned_text[tag_end + 1:].strip()
592
-
593
- # The content after the tag is often still reasoning
594
- # Look for patterns that indicate the start of the actual answer
595
- # Strategy: Find the last sentence that doesn't contain reasoning indicators
596
-
597
- # Split into sentences
598
- sentences = re.split(r'([.!?]\s+)', after_tag)
599
- # Recombine sentences with their punctuation
600
- sentence_pairs = []
601
- for i in range(0, len(sentences) - 1, 2):
602
- if i + 1 < len(sentences):
603
- sentence_pairs.append(sentences[i] + sentences[i + 1])
604
- else:
605
- sentence_pairs.append(sentences[i])
606
-
607
- # Reasoning indicators - sentences starting with these are likely reasoning
608
- reasoning_starters = [
609
- 'okay', 'let me', 'i need to', 'first', 'let\'s see', 'the user',
610
- 'i should', 'i must', 'i have to', 'let me check', 'i\'ll',
611
- 'i will', 'i can', 'i want to', 'i think', 'i believe'
612
- ]
613
-
614
- # Find the last sentence that doesn't start with reasoning indicators
615
- answer_sentence = None
616
- for sentence in reversed(sentence_pairs):
617
- sentence_clean = sentence.strip()
618
- if len(sentence_clean) < 10: # Too short, skip
619
- continue
620
- # Check if sentence starts with reasoning indicators
621
- first_words = ' '.join(sentence_clean.split()[:3]).lower()
622
- if not any(starter in first_words for starter in reasoning_starters):
623
- # This looks like an actual answer
624
- answer_sentence = sentence_clean
625
- break
626
-
627
- if answer_sentence:
628
- cleaned_text = answer_sentence
629
- else:
630
- # Fallback: remove the tag and take everything after, but clean it up
631
- # Remove common reasoning phrases at the start
632
- cleaned = after_tag
633
- for phrase in reasoning_starters:
634
- if cleaned.lower().startswith(phrase):
635
- # Find the end of this phrase and take what comes after
636
- words = cleaned.split()
637
- # Skip first few words that match the phrase
638
- for i, word in enumerate(words):
639
- if phrase not in ' '.join(words[:i+1]).lower():
640
- cleaned = ' '.join(words[i:])
641
- break
642
- cleaned_text = cleaned.strip()
643
 
644
- return cleaned_text.strip()
 
 
 
 
 
 
645
 
646
- def _extract_json_by_brace_matching(self, text: str, start_pos: int = 0) -> Optional[str]:
647
- """Extract JSON object by matching braces starting at given position."""
648
- brace_start = text.find('{', start_pos)
649
- if brace_start == -1:
650
- return None
651
-
652
- brace_count = 0
653
- in_string = False
654
- escape_next = False
655
- for i in range(brace_start, len(text)):
656
- if escape_next:
657
- escape_next = False
658
- continue
659
- if text[i] == '\\':
660
- escape_next = True
661
- elif text[i] == '"' and not in_string:
662
- in_string = True
663
- elif text[i] == '"' and in_string:
664
- in_string = False
665
- elif text[i] == '{' and not in_string:
666
- brace_count += 1
667
- elif text[i] == '}' and not in_string:
668
- brace_count -= 1
669
- if brace_count == 0:
670
- json_candidate = text[brace_start:i+1]
671
- try:
672
- json.loads(json_candidate)
673
- return json_candidate
674
- except json.JSONDecodeError:
675
- return None
676
  return None
677
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
678
  def _format_tools_for_prompt(self, tools: List[Dict[str, Any]]) -> str:
679
  """Format tools for inclusion in system prompt."""
680
  tools_text = (
 
183
  pass
184
 
185
  async def list_models(self) -> Dict[str, Any]:
186
+ """List available models."""
187
  return {
188
  "object": "list",
189
  "data": [
 
192
  "object": "model",
193
  "created": 1677610602,
194
  "owned_by": "DragonLLM",
195
+ "permission": [],
196
  "root": MODEL_NAME,
197
  "parent": None,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
198
  }
199
  ]
200
  }
 
350
 
351
  # Extract token counts using tokenizer for accuracy
352
  # Count prompt tokens (more accurate than shape[1] as it handles special tokens correctly)
353
+ prompt_tokens = len(inputs.input_ids[0])
354
+ generated_ids = outputs[0][inputs.input_ids.shape[1]:]
355
  generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
356
  completion_tokens = len(generated_ids)
357
 
 
 
 
358
  # ✅ If JSON output is required, try to extract JSON from the response
359
  if json_output_required:
360
  generated_text = self._extract_json_from_text(generated_text)
 
383
  finish_reason=finish_reason,
384
  ))
385
 
386
+ # Build message with optional tool_calls
387
+ message = {"role": "assistant", "content": generated_text if generated_text.strip() else None}
388
+ if tool_calls:
389
+ message["tool_calls"] = tool_calls
 
 
 
 
 
 
 
 
390
 
391
  return {
392
  "id": f"chatcmpl-{os.urandom(12).hex()}",
 
397
  {
398
  "index": 0,
399
  "message": message,
 
400
  "finish_reason": finish_reason,
 
 
401
  }
402
  ],
 
 
403
  "usage": {
404
  "prompt_tokens": prompt_tokens,
 
405
  "completion_tokens": completion_tokens,
406
+ "total_tokens": prompt_tokens + completion_tokens,
407
  },
 
 
 
408
  }
409
 
410
  async def _chat_stream(
 
415
  created = int(time.time())
416
 
417
  # Count prompt tokens
418
+ prompt_tokens = len(inputs.input_ids[0])
419
  completion_tokens = 0
420
  generated_text = ""
421
 
 
455
  {
456
  "index": 0,
457
  "delta": {"content": token},
 
458
  "finish_reason": None,
 
459
  }
460
  ],
 
 
461
  }
462
  yield f"data: {json.dumps(chunk, ensure_ascii=False)}\n\n"
463
  await asyncio.sleep(0)
 
483
  finish_reason=finish_reason,
484
  ))
485
 
486
+ # Send final chunk
487
  final_chunk = {
488
  "id": completion_id,
489
  "object": "chat.completion.chunk",
490
  "created": created,
491
  "model": model_id,
492
+ "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}],
 
 
 
 
 
 
 
 
 
 
493
  }
494
  yield f"data: {json.dumps(final_chunk, ensure_ascii=False)}\n\n"
495
  yield "data: [DONE]\n\n"
 
511
 
512
  def _remove_reasoning_tags(self, text: str) -> str:
513
  """Remove Qwen reasoning tags from text."""
514
+ # Remove reasoning tags - matches <think>...</think>
 
 
515
  cleaned_text = re.sub(
516
  r'<think>.*?</think>',
517
  '',
518
+ text,
519
  flags=re.DOTALL | re.IGNORECASE
520
  )
521
 
522
+ # Handle unclosed reasoning tags (split on closing tag)
523
+ if "</think>" in cleaned_text:
524
+ parts = cleaned_text.split("</think>", 1)
525
+ if len(parts) > 1:
526
+ cleaned_text = parts[1].strip()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
527
 
528
+ # If still has opening tag but no closing, remove everything before first {
529
+ if "<think>" in cleaned_text.lower() and "{" in cleaned_text:
530
+ brace_pos = cleaned_text.find('{')
531
+ if brace_pos != -1:
532
+ cleaned_text = cleaned_text[brace_pos:]
533
+
534
+ return cleaned_text
535
 
536
+ def _extract_json_by_brace_matching(self, text: str, start_pos: int = 0) -> Optional[str]:
537
+ """Extract JSON object by matching braces starting at given position."""
538
+ brace_start = text.find('{', start_pos)
539
+ if brace_start == -1:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
540
  return None
541
 
542
+ brace_count = 0
543
+ in_string = False
544
+ escape_next = False
545
+ for i in range(brace_start, len(text)):
546
+ if escape_next:
547
+ escape_next = False
548
+ continue
549
+ if text[i] == '\\':
550
+ escape_next = True
551
+ elif text[i] == '"' and not in_string:
552
+ in_string = True
553
+ elif text[i] == '"' and in_string:
554
+ in_string = False
555
+ elif text[i] == '{' and not in_string:
556
+ brace_count += 1
557
+ elif text[i] == '}' and not in_string:
558
+ brace_count -= 1
559
+ if brace_count == 0:
560
+ json_candidate = text[brace_start:i+1]
561
+ try:
562
+ json.loads(json_candidate)
563
+ return json_candidate
564
+ except json.JSONDecodeError:
565
+ return None
566
+ return None
567
+
568
  def _format_tools_for_prompt(self, tools: List[Dict[str, Any]]) -> str:
569
  """Format tools for inclusion in system prompt."""
570
  tools_text = (
docs/STRUCTURED_OUTPUTS_COMPARISON.md DELETED
@@ -1,132 +0,0 @@
1
- # Structured Outputs: vLLM vs PydanticAI Comparison
2
-
3
- ## Overview
4
-
5
- This document compares how vLLM and PydanticAI handle structured outputs, and why they may not be fully compatible.
6
-
7
- ## vLLM Structured Outputs
8
-
9
- ### Method
10
- vLLM uses **`extra_body`** parameter with `structured_outputs` key (NOT standard OpenAI `response_format`):
11
-
12
- ```python
13
- completion = client.chat.completions.create(
14
- model="DragonLLM/Qwen-Open-Finance-R-8B",
15
- messages=[{"role": "user", "content": "Generate JSON..."}],
16
- extra_body={
17
- "structured_outputs": {
18
- "json": json_schema # Pydantic model.model_json_schema()
19
- }
20
- }
21
- )
22
- ```
23
-
24
- ### Supported Formats
25
- 1. **JSON Schema**: `{"json": json_schema}`
26
- 2. **Regex**: `{"regex": r"pattern"}`
27
- 3. **Choice**: `{"choice": ["option1", "option2"]}`
28
- 4. **Grammar**: `{"grammar": "CFG definition"}`
29
-
30
- ### Response Format
31
- - Returns JSON string in `message.content`
32
- - No tool calls involved
33
- - Direct JSON in content field
34
-
35
- ## PydanticAI Structured Outputs
36
-
37
- ### Method
38
- PydanticAI uses **tool calling** with `tool_choice="required"`:
39
-
40
- ```python
41
- agent = Agent(model, system_prompt="...")
42
- result = await agent.run(prompt, output_type=Portfolio)
43
- ```
44
-
45
- ### How It Works
46
- 1. PydanticAI converts `output_type` (Pydantic model) to a tool definition
47
- 2. Sends request with:
48
- - `tools`: [function definition matching the schema]
49
- - `tool_choice`: `"required"` (forces tool call)
50
- 3. Expects response with `tool_calls` array
51
- 4. Extracts JSON from `tool_calls[0].function.arguments`
52
-
53
- ### Expected Response Format
54
- ```json
55
- {
56
- "choices": [{
57
- "message": {
58
- "tool_calls": [{
59
- "function": {
60
- "name": "...",
61
- "arguments": "{\"field\": \"value\"}" // JSON string
62
- }
63
- }]
64
- }
65
- }]
66
- }
67
- ```
68
-
69
- ## Compatibility Issue
70
-
71
- ### Problem
72
- - **vLLM**: Uses `extra_body.structured_outputs` → Returns JSON in `message.content`
73
- - **PydanticAI**: Uses `tools` + `tool_choice="required"` → Expects JSON in `tool_calls[].function.arguments`
74
-
75
- ### Current Status
76
- - ✅ **HF Space**: Works because it implements tool calling support
77
- - ❌ **vLLM**: Fails because vLLM's structured outputs return JSON in `content`, not `tool_calls`
78
-
79
- ## Solutions
80
-
81
- ### Option 1: Use vLLM's `extra_body` (Recommended)
82
- Modify PydanticAI's OpenAI provider to detect vLLM and use `extra_body` instead of tools:
83
-
84
- ```python
85
- # In PydanticAI OpenAI provider
86
- if output_type:
87
- json_schema = output_type.model_json_schema()
88
- # Use vLLM structured_outputs instead of tools
89
- extra_body = {
90
- "structured_outputs": {"json": json_schema}
91
- }
92
- ```
93
-
94
- ### Option 2: Add Tool Call Support to vLLM Response
95
- When vLLM receives `tools` + `tool_choice="required"`, wrap the structured output in a tool call format.
96
-
97
- ### Option 3: Use `response_format` (Limited)
98
- Standard OpenAI `response_format={"type": "json_object"}` works but:
99
- - Only enforces JSON, not schema validation
100
- - PydanticAI would need to parse and validate manually
101
- - Less reliable than schema-based approaches
102
-
103
- ## Current Implementation Status
104
-
105
- ### HF Space (Transformers)
106
- - ✅ Supports tool calling (text-based parsing)
107
- - ✅ Supports `response_format`
108
- - ✅ Works with PydanticAI's tool-based approach
109
-
110
- ### vLLM
111
- - ✅ Supports `extra_body.structured_outputs` (JSON schema)
112
- - ❌ Does NOT support tool calling for structured outputs
113
- - ✅ Supports `response_format` (basic JSON mode only)
114
-
115
- ## Recommendation
116
-
117
- For full compatibility with PydanticAI, we need to:
118
-
119
- 1. **Detect vLLM endpoint** in PydanticAI provider
120
- 2. **Use `extra_body.structured_outputs`** instead of tools when using vLLM
121
- 3. **Parse `message.content`** instead of `tool_calls` for vLLM responses
122
-
123
- Alternatively, implement a middleware in the HF Space API that:
124
- - Detects `tools` + `tool_choice="required"` requests
125
- - Converts to `extra_body.structured_outputs` for vLLM
126
- - Wraps response in tool call format for PydanticAI compatibility
127
-
128
- ## References
129
-
130
- - [vLLM Structured Outputs Docs](https://docs.vllm.ai/en/stable/features/structured_outputs/)
131
- - [PydanticAI Documentation](https://ai.pydantic.dev/)
132
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
start-vllm.sh CHANGED
@@ -1,6 +1,7 @@
1
  #!/bin/bash
2
  # vLLM OpenAI-compatible API server startup script
3
- # This script ensures args are always passed, even if Koyeb clears CMD
 
4
 
5
  set -e
6
 
@@ -10,6 +11,7 @@ PORT="${PORT:-8000}"
10
  MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
11
  GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.90}"
12
  DTYPE="${DTYPE:-bfloat16}"
 
13
 
14
  # HF Token - HF_TOKEN_LC2 is the model access token (priority)
15
  export HF_TOKEN="${HF_TOKEN_LC2:-${HF_TOKEN:-${HUGGING_FACE_HUB_TOKEN:-}}}"
@@ -22,31 +24,37 @@ echo "Model: $MODEL"
22
  echo "Port: $PORT"
23
  echo "Max Model Len: $MAX_MODEL_LEN"
24
  echo "GPU Memory Utilization: $GPU_MEMORY_UTILIZATION"
 
25
  echo "HF Token: ${HF_TOKEN:+set (${#HF_TOKEN} chars)}"
26
  echo "=========================================="
27
 
28
- # Execute vLLM server (use python3, not python)
29
- # Enable tool calling support for OpenAI-compatible API
30
- # For Qwen3 models, valid parsers are: qwen3_coder, qwen3_xml
31
- # If TOOL_CALL_PARSER is not set, use --enable-auto-tool-choice only
32
  VLLM_ARGS=(
33
  --model "$MODEL"
34
  --trust-remote-code
35
  --dtype "$DTYPE"
36
  --max-model-len "$MAX_MODEL_LEN"
37
  --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION"
 
38
  --port "$PORT"
39
  --host 0.0.0.0
40
- --enable-auto-tool-choice
41
  )
42
 
43
- # Add tool-call-parser only if explicitly specified
44
- # For Qwen3 models, use: qwen3_xml or qwen3_coder
45
- if [ -n "${TOOL_CALL_PARSER:-}" ]; then
46
- VLLM_ARGS+=(--tool-call-parser "$TOOL_CALL_PARSER")
47
- echo "Tool Calling: ENABLED (auto-tool-choice, parser: $TOOL_CALL_PARSER)"
 
 
 
 
 
48
  else
49
- echo "Tool Calling: ENABLED (auto-tool-choice only, no parser)"
50
  fi
51
 
 
 
 
52
  exec python3 -m vllm.entrypoints.openai.api_server "${VLLM_ARGS[@]}"
 
1
  #!/bin/bash
2
  # vLLM OpenAI-compatible API server startup script
3
+ # Compatible with Koyeb GPU deployment patterns
4
+ # Based on Koyeb's one-click vLLM + Qwen deployment templates
5
 
6
  set -e
7
 
 
11
  MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
12
  GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.90}"
13
  DTYPE="${DTYPE:-bfloat16}"
14
+ TENSOR_PARALLEL_SIZE="${TENSOR_PARALLEL_SIZE:-${KOYEB_GPU_COUNT:-1}}"
15
 
16
  # HF Token - HF_TOKEN_LC2 is the model access token (priority)
17
  export HF_TOKEN="${HF_TOKEN_LC2:-${HF_TOKEN:-${HUGGING_FACE_HUB_TOKEN:-}}}"
 
24
  echo "Port: $PORT"
25
  echo "Max Model Len: $MAX_MODEL_LEN"
26
  echo "GPU Memory Utilization: $GPU_MEMORY_UTILIZATION"
27
+ echo "Tensor Parallel Size: $TENSOR_PARALLEL_SIZE"
28
  echo "HF Token: ${HF_TOKEN:+set (${#HF_TOKEN} chars)}"
29
  echo "=========================================="
30
 
31
+ # Build vLLM arguments
 
 
 
32
  VLLM_ARGS=(
33
  --model "$MODEL"
34
  --trust-remote-code
35
  --dtype "$DTYPE"
36
  --max-model-len "$MAX_MODEL_LEN"
37
  --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION"
38
+ --tensor-parallel-size "$TENSOR_PARALLEL_SIZE"
39
  --port "$PORT"
40
  --host 0.0.0.0
 
41
  )
42
 
43
+ # Tool Calling Support
44
+ # ENABLED BY DEFAULT for Qwen models (using hermes parser)
45
+ # Set ENABLE_AUTO_TOOL_CHOICE=false to disable
46
+ # For Qwen models, the default parser is 'hermes'
47
+ ENABLE_AUTO_TOOL_CHOICE="${ENABLE_AUTO_TOOL_CHOICE:-true}"
48
+ TOOL_CALL_PARSER="${TOOL_CALL_PARSER:-hermes}"
49
+
50
+ if [ "${ENABLE_AUTO_TOOL_CHOICE}" = "true" ]; then
51
+ VLLM_ARGS+=(--enable-auto-tool-choice --tool-call-parser "$TOOL_CALL_PARSER")
52
+ echo "Tool Calling: ENABLED (parser: $TOOL_CALL_PARSER)"
53
  else
54
+ echo "Tool Calling: DISABLED"
55
  fi
56
 
57
+ echo "=========================================="
58
+
59
+ # Execute vLLM server
60
  exec python3 -m vllm.entrypoints.openai.api_server "${VLLM_ARGS[@]}"
start.sh DELETED
@@ -1,10 +0,0 @@
1
- #!/bin/bash
2
- # Get port from environment variable, default to 7860
3
- PORT=${PORT:-7860}
4
-
5
- # Redirect all output to stderr so it shows in logs
6
- exec >&2
7
-
8
- # Start uvicorn with the specified port
9
- exec python -m uvicorn app.main:app --host 0.0.0.0 --port "$PORT"
10
-