File size: 6,424 Bytes
73521f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
license: cc-by-nc-4.0
library_name: transformers
tags:
- reinforcement-learning
- llm-routing
- cost-optimization
- tool-calling
- multi-model
pipeline_tag: text-generation
---

# xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning

<div align="center">

[![arXiv](https://img.shields.io/badge/arXiv-2510.08439-b31b1b.svg)](https://arxiv.org/abs/2510.08439)
[![GitHub](https://img.shields.io/badge/GitHub-SalesforceAIResearch%2FxRouter-blue?logo=github)](https://github.com/SalesforceAIResearch/xRouter)
[![License](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)

</div>

Welcome to **xRouter**, Salesforce AI Research's intelligent LLM routing system trained with reinforcement learning to dynamically select optimal models from 20+ available LLMs while optimizing for both performance and cost.

Modern LLM deployments face a widening cost-performance spectrum: premium models deliver strong reasoning but are expensive, while lightweight models are economical yet brittle on complex tasks. **xRouter** learns end-to-end routing policies that balance quality and cost through explicit cost-aware reward shaping, eliminating the need for hand-engineered routing rules.

## ⭐ Highlights

- **Cost-Aware Optimization**: RL-trained policies minimize costs (up to 60% reduction) while maintaining quality
- **Adaptive Routing**: Dynamic model selection based on query complexity - routes simple queries to budget models, complex ones to premium models
- **Tool-Calling Architecture**: Learns to effectively invoke 20+ models (GPT-5, o3/o4, DeepSeek R1, Qwen3, Kimi K2, etc.) and select best responses
- **Multi-Model Orchestration**: Coordinates responses from multiple LLMs for complex reasoning tasks
- **Learned Prompt Engineering**: Automatically generates optimized system prompts for target models

---

## πŸ“Š Model Details

- **Developed by**: Salesforce AI Research
- **Base Model**: Qwen/Qwen2.5-7B-Instruct
- **Model Type**: Instruction-tuned language model with tool-calling capabilities
- **Training Algorithm**: DAPO (Distributional Advantage Policy Optimization) with cost-aware reward shaping
- **Training Data**: Derived from [Reasoning360](https://github.com/LLM360/Reasoning360) - math, code, reasoning, and STEM tasks
- **License**: CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)

## πŸ“ˆ Key Results

- **Substantial cost reductions** (up to 60%) at comparable task completion rates
- Evaluated on **17 diverse benchmarks** spanning math, coding, reasoning, and OOD tasks
- **Adaptive behavior**: Learns when to use premium vs. budget models without explicit rules
- **Multi-turn reasoning**: Effectively coordinates multiple model calls for complex tasks

For detailed results, see our [paper](https://arxiv.org/abs/2510.08439).

## πŸ› οΈ Usage

### Installation

```bash
# Clone the repository
git clone https://github.com/SalesforceAIResearch/xRouter.git
cd xRouter

# Set up environment
conda create -n xrouter python=3.12
conda activate xrouter

pip install uv
uv pip install torch==2.6.0
uv pip install flash-attn==2.7.3 --no-build-isolation
uv pip install -e .[gpu,math,vllm,test]
pip install litellm rich python-dotenv
```

### Configure API Keys

```bash
export OPENAI_API_KEY="your_openai_key"
export TOGETHER_API_KEY="your_together_key"
export GEMINI_API_KEY="your_gemini_key"  # optional
```

### πŸš€ Deployment

```bash
# Host the router model
cd evaluation
bash host_router.sh  # Serves on port 8000

# Launch the router API (in another terminal)
bash serve_router.sh  # Serves on port 8800
```

### πŸ’¬ Usage Example

```python
import openai

# Initialize client
client = openai.OpenAI(
    base_url="http://localhost:8800/v1",
    api_key="dummy"
)

# Send request
response = client.chat.completions.create(
    model="router-tool-rl",
    messages=[
        {"role": "user", "content": "Solve: If x^2 + 2x + 1 = 0, what are the values of x?"}
    ],
    max_tokens=1000
)

print(response.choices[0].message.content)

# Access routing metadata
metadata = response.router_metadata
print(f"Model used: {metadata['model_used']}")
print(f"Total cost: ${metadata['total_cost']:.6f}")
```

## πŸŽ“ Training Methodology

xRouter uses **DAPO** (Distributional Advantage Policy Optimization) with cost-aware reward shaping:

```
reward = quality - Ξ» Γ— normalized_cost
```

**Training Features**:
- Cost-aware rewards penalize expensive routing decisions
- Multi-turn credit assignment across conversation turns
- Tool augmentation with 20+ model tools + response selection
- Curriculum learning from simple to complex tasks

**Supported Model Tiers**:

| Tier | Models | Best For |
|------|--------|----------|
| **Premium** | GPT-5, GPT-4.1, o3, Qwen3-235B-Instruct, Kimi K2 | Mission-critical tasks |
| **Standard** | GPT-5-Mini, GPT-4.1-Mini, o4-Mini, GPT-OSS-120B | Balanced performance |
| **Budget** | GPT-5-Nano, GPT-4.1-Nano, GPT-4o-Mini, GPT-OSS-20B | High-volume tasks |
| **Specialized** | o3, DeepSeek-R1, Qwen3-235B-Thinking, Qwen3-Coder-480B | Domain-specific |

## πŸ“š Citation

```bibtex
@article{qian2025xrouter,
  title={xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning},
  author={Qian, Cheng and Liu, Zuxin and Kokane, Shirley and Prabhakar, Akshara and Qiu, Jielin and Chen, Haolin and Liu, Zhiwei and Ji, Heng and Yao, Weiran and Heinecke, Shelby and Savarese, Silvio and Xiong, Caiming and Wang, Huan},
  journal={arXiv preprint arXiv:2510.08439},
  year={2025}
}
```

## πŸ”— Resources

- πŸ“„ **Paper**: [arXiv:2510.08439](https://arxiv.org/abs/2510.08439)
- πŸ’» **Code Repository**: [github.com/SalesforceAIResearch/xRouter](https://github.com/SalesforceAIResearch/xRouter)
- πŸ€— **Model Hub**: [Salesforce/xRouter](https://huggingface.co/Salesforce/xRouter)

## πŸ™ Acknowledgements

This project builds upon exceptional work from the open-source community:
- **[Reasoning360](https://github.com/LLM360/Reasoning360)**: Foundational RL training framework
- **[VERL](https://github.com/volcengine/verl)**: RL infrastructure for distributed LLM training
- **[SGLang](https://github.com/sgl-project/sglang)**: High-performance LLM serving backend
- **[LiteLLM](https://github.com/BerriAI/litellm)**: Unified API interface for 20+ LLM providers

---

_🏒 Developed by Salesforce AI Research_