Instructions to use HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit") model = PeftModel.from_pretrained(base_model, "HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Unsloth Studio
How to use HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora", max_seq_length=2048, )
SkeptiSTEM-4B-v2 Stage R3 (GRPO LoRA)
This is the Stage R3 GRPO LoRA trained with the DOUBT (Data with Obfuscated Untruths for Better Thinking) framework.
DOUBT Framework
DOUBT teaches models to verify information rather than blindly accept it by:
- 60% neutral examples (no suggestion)
- 20% helpful hints (correct answers)
- 20% poison hints (deliberately wrong answers)
Models receive rewards for:
- Acknowledging but rejecting false information
- Accepting helpful hints while verifying
- Producing correct final answers
Training Details
- Base model: HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit (with R2 format adapter merged)
- Dataset: GSM8K (~7,473 examples)
- Algorithm: GRPO (Group Relative Policy Optimization)
- Max steps: 500
- LoRA rank: 64
Expected Load Order
- Base:
HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit - Merge/apply Stage R2:
HallD/SkeptiSTEM-4B-v2-stageR2-format-lora - Apply this Stage R3 adapter
Usage
from unsloth import FastLanguageModel
from peft import PeftModel
# Load base
base, tok = FastLanguageModel.from_pretrained(
"HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit",
max_seq_length=4096,
load_in_4bit=True,
)
# Merge R2 format
base = PeftModel.from_pretrained(base, "HallD/SkeptiSTEM-4B-v2-stageR2-format-lora")
base = base.merge_and_unload()
# Apply R3 GRPO
model = PeftModel.from_pretrained(base, "HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora")
FastLanguageModel.for_inference(model)
Reward Statistics
Final reward averages (last 50 steps):
- Poison: 9.99
- Helpful: 11.18
- Neutral: 10.79
Trained with Unsloth.
- Downloads last month
- -
Model tree for HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora
Base model
Qwen/Qwen3-4B-Base Finetuned
unsloth/Qwen3-4B-Base