Arko007/Zenyx_114M-Tiny-Edu-Instruct 🎓

Note: This is an experimental research artifact. While the model achieved excellent training convergence (Validation Loss: ~1.04), inference on such a small scale (50M parameters) is highly sensitive to sampling parameters and may result in repetition loops.

TinyEdu-50M-Instruct is a 50-million parameter language model designed to test the limits of instruction tuning on extremely small architectures. It is based on the TinyEdu-50M foundation model, which was pre-trained on the FineWeb-Edu dataset (100B+ tokens), and subsequently fine-tuned on a high-quality mix of instruction and code data.

📊 Model Details

Architecture: Custom GPT-2 style Transformer with Rotary Positional Embeddings (RoPE) and Grouped Query Attention (GQA).

Parameters: ~114 Million

Context Length: 4096 tokens

Vocabulary Size: 50,257 (GPT-2 Tokenizer)

Precision: Float16 (Training), Float32 (Recommended for Inference)

📉 Training Data & Metrics

Datasets used for SFT (Supervised Fine-Tuning):

OpenHermes-2.5: 60% of the mix. High-quality general instruction following.

CodeFeedback-Filtered: 40% of the mix. Code generation and logic.

Training Dynamics

Hardware: Single NVIDIA Tesla P100 (16GB VRAM)

Optimization: DeepSpeed ZeRO-2, Gradient Checkpointing

Batch Size: 1 (Effective Batch Size: 64 via Gradient Accumulation)

Start Loss: ~11.0

Final Validation Loss: 1.0391 (Converged)

⚠️ Known Limitations

Due to the extremely small parameter count (114M), this model exhibits specific behaviors:

Repetition Loops: Without a high repetition_penalty (>= 1.2), the model tends to get stuck repeating tokens (e.g., )))) or ......).

Prompt Sensitivity: It requires strict Alpaca formatting (### Instruction: ... ### Response:).

Hallucination: As with all tiny models, factuality is low. It captures the structure of language and code well but may invent facts.

🚀 How to Use

Since this uses a custom architecture implementation in raw PyTorch, you cannot use AutoModelForCausalLM directly. You must define the class structure first.

Install Dependencies

pip install torch transformers huggingface_hub

Run Inference

import torch import torch.nn as nn import torch.nn.functional as F from transformers import AutoTokenizer from huggingface_hub import hf_hub_download

--- 1. Define Model Architecture ---

class Config: vocab_size = 50257 n_layer = 12 n_head = 12 n_embd = 768 n_kv_heads = 4 block_size = 4096 dropout = 0.0 bias = False rope_base = 10000

class RotaryEmbedding(nn.Module): def init(self, dim, max_position_embeddings=2048, base=10000): super().init() inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim)) self.register_buffer("inv_freq", inv_freq, persistent=False) self.max_seq_len_cached = max_position_embeddings t = torch.arange(self.max_seq_len_cached, dtype=self.inv_freq.dtype) freqs = torch.einsum("i,j->ij", t, self.inv_freq) emb = torch.cat((freqs, freqs), dim=-1) self.register_buffer("cos_cached", emb.cos()[None, None, :, :], persistent=False) self.register_buffer("sin_cached", emb.sin()[None, None, :, :], persistent=False) def forward(self, x, seq_len=None): if seq_len > self.max_seq_len_cached: self.max_seq_len_cached = seq_len t = torch.arange(self.max_seq_len_cached, dtype=self.inv_freq.dtype, device=x.device) freqs = torch.einsum("i,j->ij", t, self.inv_freq) emb = torch.cat((freqs, freqs), dim=-1) self.register_buffer("cos_cached", emb.cos()[None, None, :, :], persistent=False) self.register_buffer("sin_cached", emb.sin()[None, None, :, :], persistent=False) return self.cos_cached[:, :, :seq_len, ...], self.sin_cached[:, :, :seq_len, ...]

def rotate_half(x): x1, x2 = x.chunk(2, dim=-1) return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin): cos = cos.to(q.device) sin = sin.to(q.device) q_embed = (q * cos) + (rotate_half(q) * sin) k_embed = (k * cos) + (rotate_half(k) * sin) return q_embed, k_embed

class GroupedQueryAttention(nn.Module): def init(self, config): super().init() self.n_head = config.n_head self.n_kv_heads = config.n_kv_heads self.head_dim = config.n_embd // config.n_head self.n_rep = self.n_head // self.n_kv_heads self.q_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias) self.k_proj = nn.Linear(config.n_embd, self.n_kv_heads * self.head_dim, bias=config.bias) self.v_proj = nn.Linear(config.n_embd, self.n_kv_heads * self.head_dim, bias=config.bias) self.out_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias) self.rotary_emb = RotaryEmbedding(self.head_dim, max_position_embeddings=config.block_size, base=config.rope_base)

def forward(self, x):
    B, T, C = x.size()
    q = self.q_proj(x).view(B, T, self.n_head, self.head_dim).transpose(1, 2)
    k = self.k_proj(x).view(B, T, self.n_kv_heads, self.head_dim).transpose(1, 2)
    v = self.v_proj(x).view(B, T, self.n_kv_heads, self.head_dim).transpose(1, 2)
    cos, sin = self.rotary_emb(v, seq_len=T)
    q, k = apply_rotary_pos_emb(q, k, cos, sin)
    k = k.repeat_interleave(self.n_rep, dim=1)
    v = v.repeat_interleave(self.n_rep, dim=1)
    y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
    y = y.transpose(1, 2).contiguous().view(B, T, C)
    return self.out_proj(y)

class MLP(nn.Module): def init(self, config): super().init() self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias) self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias) def forward(self, x): return self.c_proj(F.gelu(self.c_fc(x), approximate='tanh'))

class Block(nn.Module): def init(self, config): super().init() self.ln_1 = nn.LayerNorm(config.n_embd) self.attn = GroupedQueryAttention(config) self.ln_2 = nn.LayerNorm(config.n_embd) self.mlp = MLP(config) def forward(self, x): x = x + self.attn(self.ln_1(x)) return x + self.mlp(self.ln_2(x))

class TinyEduGPT(nn.Module): def init(self, config): super().init() self.config = config self.transformer = nn.ModuleDict(dict( wte = nn.Embedding(config.vocab_size, config.n_embd), drop = nn.Dropout(config.dropout), h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]), ln_f = nn.LayerNorm(config.n_embd), )) self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False) self.transformer.wte.weight = self.lm_head.weight def forward(self, idx): x = self.transformer.wte(idx) for block in self.transformer.h: x = block(x) return self.lm_head(self.transformer.ln_f(x))

--- 2. Load Weights ---

config = Config() device = "cuda" if torch.cuda.is_available() else "cpu" model = TinyEduGPT(config).to(device)

model_path = hf_hub_download("Arko007/tiny-edu-50m-instruct", "pytorch_model.bin") state_dict = torch.load(model_path, map_location=device)

Key Mapping Fix

clean_sd = {} for k, v in state_dict.items(): new_k = k.replace("_orig_mod.", "").replace("module.", "") clean_sd[new_k] = v model.load_state_dict(clean_sd, strict=False) model.eval()

--- 3. Generate ---

tokenizer = AutoTokenizer.from_pretrained("gpt2") prompt = "### Instruction:\nWrite a python function to add two numbers.\n\n### Response:\n" inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)

Greedy decoding with penalty is recommended for this model size

with torch.no_grad(): outputs = model(inputs) # Note: Full generation loop required (see generation scripts in discussions)

🛠️ Intended Uses

Educational analysis of Transformer internals.

Research into training dynamics of small language models.

Low-latency inference testing on edge devices (requires conversion to ONNX/GGUF).

⚠️ Licensing

This model is released under the Apache 2.0 License.

Downloads last month: 128

Model tree for Arko007/Zenyx_114M-Tiny-Edu-Instruct

Base model

Arko007/tiny-edu-50m

Finetuned

(1)

this model

Arko007
/

Zenyx_114M-Tiny-Edu-Instruct