Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm Agent-ID (tall_tame_panther)

Gensyn RL-Swarm: Training & GGUF Quantized LLMs for Inference

Model Overview

Our pick an experimental (advanced) mode at this model a continuously trained Qwen2.5-Coder-0.5B-Instruct fine-tuned using Gensyn RL-Swarm framework with GRPO (Group Relative Policy Optimization) and supported format GGUF (llama.cpp) for enhanced code generation capabilities. Note: Current training focuses on programming challenges with adaptive weighted sampling.

Agent ID: tall_tame_panther
Training Status: 🟢 LIVE - Model updates automatically every 5-10 minutes
Auto-Sync GGUF Pipeline Status: 🟢 LIVE - Commits update automatically every hour
Current Progress: Round 13,533+ / 100,000 (13.53%)
Framework Version: Gensyn RL-Swarm v0.7.0
Contract: SwarmCoordinator v0.4.2

Key Features

Real-time Training: Continuous learning with distributed RL across Gensyn swarm network
Adaptive System: Dynamic quality enhanced and dataset weighting for optimal learning
Multi-domain Coding: Trained on MBPP and CodeContests datasets with adaptive sampling
GGUF Support: Multiple quantized formats available (F16, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K)
llama.cpp Compatible: Ready for edge deployment and local inference
BF16 Precision: Trained with bfloat16 for optimal performance
TGI Compatible: Supports Text Generation Inference for production deployment
Chat Format Support: Inherits Qwen2.5 chat template for conversational use

Training Data

The model is trained on a composite dataset with adaptive weighted sampling strategy:

Dataset	Initial Weight	Adaptive Range	Focus Area
MBPP	5	4-6	Basic Python programming problems with test cases
CodeContests	5	4-6	Competitive programming challenges

Total Dataset Size: Streaming datasets with infinite iteration
Training Samples per Round: 2
Evaluation: Real-time via Swarm Coordination with Ollama-based evaluator else Judge

Adaptive Sampling Strategy

"When the solvers perform well, the proposer automatically increases the difficulty to keep challenging solvers to get better over time." - CodeZero-blog

The implementation features an adaptive sampling system that adjusts dataset weights based on performance
The system monitors performance metrics every 5 rounds and adjusts the dataset weights to maintain optimal learning balance
- Update dataset weights based on recent performance
- Calculate recent average performance for each dataset
- Adjust/use weighted sampling if adaptive, based on perform difference
- Performance better on MBPP (Mostly Basic Python Problems)
- Performance better on CodeContests
- Update dataset weights every rounds & keep balanced

Adaptive Reward System

Quality Enhanced Implementation

"Rewards are derived from multiple lightweight checks, ranging from code validity and formatting to alignment with the problem statement, combined into a single interpretable score." - CodeZero-blog

The reward system includes a quality data enhanced mechanism that evaluates code structure and documentation
- Calculate quality data enhanced for well-structured code
- Documentation enhanced
- Structure enhanced
- Algorithmic efficiency (simple heuristic)
- Scale with base reward to avoid inflation

Adaptive Threshold System

The system also includes an adaptive threshold mechanism that adjusts based on recent performance
- Function adaptive threshold based on recent performance
- Performance quality data is consistently high

Quick Performance Simulation

Reward Comparison

Based on our simulation with 1000 samples, the adaptive reward system shows significant improvement

System	MBPP Avg Reward	CodeContests Avg Reward	Overall Avg Reward	Improvement
Original	0.234	-0.156	0.039	-
Adaptive	0.312	-0.098	0.107	~174%

Training Progress

Based on the logs provided, the model shows consistent progress:

Metric data visualize train/loss by Weights & Biases (WanDB)

Soon LIVE!

[2025-11-14 04:22:50,632][genrl.logging_utils.global_defs][INFO] - __ Joining round: 13053
[2025-11-14 04:23:50,633][genrl.logging_utils.global_defs][INFO] - Starting round: 13053/100000.
Map: 100%|______________________________________| 1/1 [00:00<00:00, 158.65 examples/s]
Map: 100%|______________________________________| 1/1 [00:00<00:00, 191.92 examples/s]
[2025-11-14 04:25:12,646][genrl.logging_utils.global_defs][INFO] - pushing model to huggingface
Processing Files (1 / 1)      : 100%|___|  988MB /  988MB, 94.3MB/s
New Data Upload               : 100%|___|  983MB /  983MB, 94.3MB/s  
.....kpb5lid/model.safetensors: 100%|___|  988MB /  988MB, 94.3MB/s
[2025-11-14 04:27:01,877][genrl.logging_utils.global_defs][INFO] - Already finished round: 13053. Next check in 160.0s.

Quick Start Inferences

Standard Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
    "0xgr3y/Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm-tall_tame_panther",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("0xgr3y/Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm-tall_tame_panther")
prompt = "Write a function to calculate the factorial of a number."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=256, temperature=0.7, top_p=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Chat Format (Conversational)

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("0xgr3y/Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm-tall_tame_panther")
tokenizer = AutoTokenizer.from_pretrained("0xgr3y/Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm-tall_tame_panther")
messages = [
    {"role": "system", "content": "You are an expert Python programmer."},
    {"role": "user", "content": "Write a function to check if a string is a palindrome."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=512)
print(tokenizer.decode(outputs[0]))

Text Generation Inference (TGI)

docker run -d --gpus all \
  -p 8080:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id 0xgr3y/Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm-tall_tame_panther \
  --max-input-length 4096 \
  --max-total-tokens 8192

GGUF with LLAMA.CPP

# Download quantized model (recommended: Q4_K_M)
wget https://huggingface.co/0xgr3y/Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm-tall_tame_panther/resolve/main/Qwen2.5-Coder-0.5B-Q4_K_M.gguf
# Run inference
./llama-cli -m Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm-Q4_K_M.gguf \
  -p "Write a function to implement binary search in Python." \
  --temp 0.7 --top-p 0.8

Ollama

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./0xgr3y/Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm-tall_tame_panther/Qwen2.5-Coder-0.5B-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.8
PARAMETER top_k 20
SYSTEM "You are an expert Python programmer who writes clean, documented code."
EOF
# Create and run
ollama create qwen2.5-coder-swarm -f Modelfile
ollama run qwen2.5-coder-swarm "Write a function to calculate the factorial of a number."

Available GGUF Quantization

Format	Size	Precision	Use Case	Download
Safetensors (BF16)	988 MB	BF16	Full precision training/fine-tuning	`model.safetensors`
GGUF F16	994 MB	FP16	High quality inference	`Qwen2.5-Coder-0.5B-F16.gguf`
GGUF Q6_K	506 MB	6-bit	High quality compression	`Qwen2.5-Coder-0.5B-Q6_K.gguf`
GGUF Q5_K_M	420 MB	5-bit	Balanced quality/size	`Qwen2.5-Coder-0.5B-Q5_K_M.gguf`
GGUF Q4_K_M	398 MB	4-bit	Recommended for production	`Qwen2.5-Coder-0.5B-Q4_K_M.gguf`
GGUF Q3_K_M	355 MB	3-bit	Smallest, fastest	`Qwen2.5-Coder-0.5B-Q3_K_M.gguf`

All GGUF formats are llama.cpp is compatible ready to use Inferences chat and auto-update be hourly.

Chat Format & Conversational

This model inherits Qwen2.5's chat template for structured conversations.

Format Structure

<|im_start|>system
{system_message}
<|im_end|>
<|im_start|>user
{user_message}
<|im_end|>
<|im_start|>assistant
{assistant_response}
<|im_end|>

Chat Template Features

System Instructions: Guide model behavior with system messages
Multi-turn Dialogue: Maintains conversation context
Tool Calling: Support function calling (if enabled in training)
Code Generation: Optimized for generating Python code

Note: While model supports chat format structurally, optimal conversational performance depends on whether training data included formatted dialogues. Current training focuses on programming challenges.

Gensyn RL-Swarm Quick-Architecture

Training Framework:
- Method: GRPO (Group Relative Policy Optimization)
- Base Model: Qwen/Qwen2.5-Coder-0.5B-Instruct
- Training Regime: bfloat16 mixed precision
- Max Rounds: 100000
- Update Frequency: Every 5-10 minutes
- Generations per Round: 2
- Batch size: Combine
- Tree-based Model: 2 tree
- Seed: 42
Blockchain Integration:
- Network: Gensyn Testnet
- Chain ID: 685685
- Contract: SwarmCoordinator v0.4.2
Swarm Communication:
- Framework: Hivemind P2P Backend
- Initial Peers: 3 bootnodes
- Beam Size: 10
Reward System:
- Manager: RewardManager (SwarmGameManager/CodeGenerationRewards)
- Reward Function: Adaptive with quality enhanced
- Evaluator: Ollama (qwen2.5-coder:1.5b-instruct)
- Judge API: https://codezero-judge.gensyn.ai

Model Capabilities

This model excels at:

Basic Python Programming: Functions, loops, conditionals, data structures
Algorithm Implementation: Sorting, searching, graph algorithms
String Manipulation: Pattern matching, parsing, formatting
Mathematical Functions: Calculations, conversions, formulas
Code Documentation: Writing clear, commented functions
Problem Solving: Breaking down complex problems into manageable steps

Limitations

Specialized Domain: Optimized for programming challenges; may underperform on creative writing
Training in Progress: Weights update every 5-10 minutes; performance varies
Scale: 0.5B parameters - suitable for edge but not SOTA for complex programming
Experimental: Decentralized RL training; behavior less predictable than supervised models
Context: Best performance within 4K tokens (full 32K supported)

Update Schedule

Format	Frequency	Trigger
Safetensors (BF16)	Every 5-10 min	Automatic via RL-Swarm
GGUF (all formats)	Every 3 hour	Auto-conversion pipeline

Auto-Conversion Pipeline:

Monitors repo for new training commits
Downloads latest model.safetensors
Converts to F16 GGUF base
Quantizes to Q3_K_M, Q4_K_M, Q5_K_M, Q6_K
Standar formats

Check commit history for exact timestamps.

Architecture Components

Game Manager: Orchestrates training rounds and swarm coordination
Trainer: GRPO implementation for policy optimization
Data Manager: Dataset loading with adaptive weighted sampling
Reward Manager: Computes rewards via Ollama evaluator with quality enhanced
Coordinator: Blockchain integration for swarm state
P2P Backend: Hivemind DHT for model sharing

Training Process

1. Agent joins swarm via P2P network
2. Coordinator assigns round via smart contract
3. Agent samples data from adaptive weighted datasets
4. Model generates 2 responses
5. Ollama evaluator assesses and assigns rewards with quality enhanced
6. GRPO updates policy based on rewards
7. Updated model shared via DHT
8. Best checkpoint saved to HuggingFace
9. Repeat

Decentralization Benefits

Fault Tolerance: Multiple agents; no single point of failure
Diverse Exploration: Different agents explore different strategies
Collective Intelligence: Agents learn from each other
Transparent: All rounds verified on-chain

Software Stack

Framework: Gensyn RL-Swarm v0.7.0
Library: transformers v4.57.1
P2P: hivemind
Blockchain: Gensyn testnet
Config: Hydra + OmegaConf
Logging: WandB integration

Hardware Requirements

Training GPU:

GPU: NVIDIA 4090 24GB+ (BF16 training)
RAM: 16GB+
Cores: 10+
Storage: 50GB SSD
Network: High bandwidth for P2P

Training CPU Optimize:

CPU: INTEL or AMD
Cores: 10+
RAM: 16GB+
Storage: 50GB SSD
Network: High bandwidth for P2P

Inference:

Safetensors: 8GB VRAM (GPU) / 16GB RAM (CPU)
GGUF Q4_K_M: 2GB VRAM (GPU) / 4GB RAM (CPU)
GGUF Q3_K_M: 3GB RAM (CPU-only)

Training Progress Metrics

Metric	Value	Target
Completed Rounds	13,533+	100,000
Training Progress	13.53%	100%
Update Frequency	5-10 min	Continuous

Note: average@k: Average performance across k attempts, measuring consistency. pass@k: Probability of at least one correct solution in k attempts, measuring capability.Current metrics track training rounds completed in decentralized swarm.

Adaptive Reward Performance

Our adaptive reward system has shown approximately ~174% improvement in reward scores compared to the baseline system:

Original:
  Overall Avg Reward: 0.039
  MBPP Avg Reward: 0.234
  CodeContests Avg Reward: -0.156
Adaptive:
  Overall Avg Reward: 0.107
  MBPP Avg Reward: 0.312
  CodeContests Avg Reward: -0.098
Improvement: 0.068 (~174% increase)

Citation

@misc{qwen2.5-coder-gensyn-swarm-2025,
  author = {0xgrey},
  title = {Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm: Continuous RL Training on Distributed Swarm with Adaptive Rewards},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/0xgr3y/Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm-tall_tame_panther}},
  note = {Agent ID: tall\_tame\_panther}
}
@misc{gensyn-rl-swarm-2025,
  title = {Gensyn RL-Swarm: Decentralized Reinforcement Learning Framework},
  author = {Gensyn AI},
  year = {2025},
  url = {https://gensyn.ai}
}
@misc{codezero-2025,
  title = {CodeZero: A Collaborative Coding Environment for Distributed RL},
  author = {Gensyn AI},
  year = {2025},
  url = {https://docs.gensyn.ai/testnet/rl-swarm/how-it-works/codezero}
}

References

Gensyn Documentation: https://docs.gensyn.ai/
Gensyn GitHub: https://github.com/gensyn-ai
RL-Swarm Contracts: https://github.com/gensyn-ai/rl-swarm-contracts
Qwen2.5-Coder Model Card: https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct
MBPP Dataset: https://huggingface.co/datasets/google-research-datasets/mbpp
CodeContests Dataset: https://huggingface.co/datasets/deepmind/code_contests
arXiv:1910.09700: ML Carbon Emissions methodology

Contact

Developer: 0xgrey
Agent ID: tall_tame_panther
Community: Gensyn Discord

⚠️ Important: This is a continuously trained model. For reproducibility, specify commit hash:

git clone https://huggingface.co/0xgr3y/Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm-tall_tame_panther
cd Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm-tall_tame_panther
git checkout <commit-hash>

Trained with 🩷 using Gensyn RL-Swarm

Downloads last month: 3,997

Safetensors

Model size

0.5B params

Tensor type

BF16

Model tree for 0xgr3y/Qwen2.5-Coder-0.5B-Instruct-Gensyn-Swarm-tall_tame_panther

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-Coder-0.5B

Quantized

(31)

this model