NeuroLex v3: Morpheme-Aware Creative Name Generator

A novel, domain-specific AI architecture that generates truly creative, pronounceable brand names, YouTube channel names, and social media handles across 25+ languages.

~2.7M parameters | Trains in ~30 min on free Colab T4 | 25+ languages

πŸš€ Quick Start

Open neurolex_train.ipynb in Google Colab (free tier T4 GPU) and run all cells. Training completes in ~30 minutes. No authentication needed β€” all datasets are public.

Why LLMs Fail at Creative Naming

Failure Mode Paper Impact
BPE vocabulary trap β€” can only recombine known tokens Wug Test (arxiv:2310.15113) Can't create truly novel morphemes
RLHF kills diversity β€” alignment creates attractor states Creativity Has Left the Chat (arxiv:2406.05587) Outputs are generic, predictable
Sampling prunes novelty β€” top-p/k removes rare forms Lost in Sampling (arxiv:2605.27268) Creative words unreachable
Analogical memorization β€” morphology via pattern matching, not rules arxiv:2411.07990 Fails on novel morphological forms
No phonotactic awareness β€” doesn't model sound-feel mappings Sound Symbolism (arxiv:2512.12245) Can't target specific vibes

Our Solution: Hybrid Morpheme+Character Transformer

Architecture Overview (v3)

Input: <BOS> <STRATEGY> <CATEGORY> <VIBE> [generated tokens...]

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HYBRID TOKENIZER                                           β”‚
β”‚ Priority: Control tokens > Morphemes (longest) > UTF-8 bytesβ”‚
β”‚ Vocab: 256 bytes + ~200 morphemes + 35 control tokens      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CAUSAL TRANSFORMER (6 layers, 256-dim, 8 heads)            β”‚
β”‚ β€’ Token embedding (weight-tied with output)                 β”‚
β”‚ β€’ Sinusoidal positional encoding                           β”‚
β”‚ β€’ Pre-norm residual blocks                                  β”‚
β”‚ β€’ Multi-head causal self-attention                          β”‚
β”‚ β€’ GELU feed-forward (1024-dim)                              β”‚
β”‚ β€’ Final LayerNorm β†’ tied linear output                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GENERATION (top-k + nucleus sampling)                       β”‚
β”‚ β€’ Temperature-controlled creativity                         β”‚
β”‚ β€’ Top-k filtering (diversity)                               β”‚
β”‚ β€’ Nucleus (top-p) filtering (coherence)                     β”‚
β”‚ β€’ EOS detection for variable-length output                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Choices (Research-Backed)

Choice Rationale Evidence
Byte-level + morphemes Infinite vocabulary + efficient morpheme learning ByT5 (arxiv:2105.13626) beats token-level on morphological tasks
In-sequence control tokens Better gradient flow than cross-attention conditioning Neologism Learning (arxiv:2510.08506, ICLR 2025)
Causal LM (not enc-dec) Simpler, proven for controlled generation GPT-style architecture
Morfessor morpheme discovery Unsupervised extraction of productive patterns MDL-based segmentation
Sound symbolism labels Universal cross-linguistic signal 27-language study (arxiv:2512.12245)
Weight-tied embeddings 30% param reduction, better generalization Press & Wolf 2017

Controllable Generation

Strategy (how the name is created)

Token Method Example
<BLEND> Combine two meaningful morphemes Cloudify, Datavex, Nexaflow
<MORPH> Add productive suffixes/prefixes Boldster, Craftium, Questify
<PHONETIC> Novel sound combinations Zyphra, Kolvex, Lumara
<CLIP> Short, clipped forms Nex, Zyp, Flox, Drex
<CROSSLANG> Mix language roots Kazeflow, Blitzcraft, Terranova

Category (20 domains)

<TECH>, <FOOD>, <GAMING>, <FASHION>, <MUSIC>, <HEALTH>, <FINANCE>, <TRAVEL>, <SCIENCE>, <ART>, <FITNESS>, <LUXURY>, <SOCIAL>, <CRYPTO>, <AI>, <ECO>, <KIDS>, <SPORTS>, <EDUCATION>, <GENERAL>

Vibe (10 sound-symbolic feels)

<SHARP>, <WARM>, <ELEGANT>, <PLAYFUL>, <POWERFUL>, <MYSTICAL>, <MINIMAL>, <FUTURISTIC>, <NATURAL>, <COSMIC>

Training Data

Dataset Purpose Size Languages
omneity-labs/ipa-dict Phonotactic patterns 5.3M words 25
AdamLucek/youtube-titles Real creative names ~50 channels EN
Curated brand examples Quality signal (200x weighted) 65 examples Multi

All datasets load via streaming or are small enough for Colab's 12GB RAM.

v3 Changes (Fixes from v2)

Bug Fixes

  1. Morfessor API β€” Fixed load_data([[w]]) β†’ load_data([(1, w) for w in words]). The API expects (count, word) tuples.
  2. Tokenizer decode β€” Fixed control token skipping in decode using proper reverse lookup dicts instead of fragile index comparisons.
  3. Memory streaming β€” IPA dict now loads via streaming to avoid 78MB download blocking.

Architecture Improvements

  1. Expanded curated examples β€” 65 brand examples (was 30) covering all 5 strategies with proper category/vibe diversity.
  2. Top-p (nucleus) sampling β€” Added alongside top-k for better generation quality.
  3. Proper save/load β€” Model saves config alongside weights for clean reloading.
  4. Quality analysis cell β€” Added generation metrics (uniqueness, length distribution, morpheme usage, V/C ratio).
  5. Comparison generation β€” New cell to compare all strategyΓ—vibe combinations side-by-side.

Sound Symbolism: Why Names "Feel" Right

Cross-linguistic research proves universal patterns in how sounds map to feelings:

SHARP/TECH: p, t, k, s, z, x, f, h, c + vowels i, e
  β†’ "Apex", "Zyphra", "Kolvex" (precise, cutting-edge)

WARM/FRIENDLY: m, n, l, b, d, g, w, r, y + vowels o, u, a  
  β†’ "Moluna", "Bloom", "Lumara" (approachable, organic)

We use these mappings to automatically label training data with vibe tags, so the model learns sound→feel correlations directly.

Files

File Description
neurolex_train.ipynb Complete Colab notebook β€” run this!
model.py Architecture (Condition Encoder + Character Decoder)
dataset.py Streaming multilingual dataset pipeline
rewards.py Multi-signal reward scoring
train.py Training script
generate.py Inference/generation
requirements.txt Dependencies

Research References

  • Hierarchical Autoregressive Transformers (arxiv:2501.10322, DeepMind 2025)
  • Sound Symbolism across 27 Languages (arxiv:2512.12245, 2025)
  • ByT5: Token-free byte-level models (arxiv:2105.13626, 2021)
  • Neologism Learning for Controllability (arxiv:2510.08506, ICLR 2025)
  • Lost in Sampling: Word Coverage Score (arxiv:2605.27268, 2025)
  • Creativity Has Left the Chat (arxiv:2406.05587, 2024)
  • T-FREE Tokenizer-Free LLMs (arxiv:2406.19223, 2024)
  • Kiki or Bouba? Sound Symbolism (arxiv:2310.16781, 2023)
  • Counting the Bugs in ChatGPT's Wugs (arxiv:2310.15113, 2023)

License

MIT


Generated with ML Intern

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for krystv/neurolex-creative-name-generator