MidiTok-REMI: BPE Tokenizer for Symbolic Music

A Byte-Pair Encoding (BPE) tokenizer trained on REMI (REpresentation of MIdi) tokens for efficient symbolic music representation. This tokenizer combines the expressiveness of REMI encoding with the compression benefits of BPE, making it ideal for training large language models on MIDI data.

Model Details

Tokenizer Type: BPE (Byte-Pair Encoding)
Base Representation: REMI (REpresentation of MIdi)
Vocabulary Size: 40,000 tokens
Training Data: GigaMIDI dataset (2M+ MIDI files)
Library: MidiTok
Compatible Models: MusicBERT, Llama-based music models

REMI Token Configuration

The tokenizer uses the following REMI configuration:

Velocity Bins: 32 levels (0-127 quantized)
Beat Resolution:
- Measures 0-4: 8 ticks per beat (fine-grained)
- Measures 4-12: 4 ticks per beat (standard)
Chord Recognition: Enabled
Special Tokens: PAD_None, BOS_None, EOS_None, MASK_None

Token Types

REMI represents MIDI files using the following event types:

Token Type	Description	Example
`Bar_None`	Measure/bar boundary	`Bar_None`
`TimeSig_X/Y`	Time signature	`TimeSig_4/4`
`Position_N`	Position within measure (ticks)	`Position_16`
`Tempo_X`	Tempo in BPM	`Tempo_121.29`
`Program_N`	MIDI program/instrument	`Program_0` (Piano)
`Pitch_N`	MIDI note pitch (0-127)	`Pitch_69` (A4)
`Velocity_N`	Note velocity (dynamics)	`Velocity_63`
`Duration_X.Y.Z`	Note duration	`Duration_4.0.4`
`Chord_X`	Chord detection	`Chord_C:maj`

Usage

Installation

pip install miditok transformers torch

Basic Usage

from miditok import MusicTokenizer
from pathlib import Path

# Load the tokenizer from HuggingFace Hub
tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")

# Tokenize a MIDI file
midi_path = Path("your_music.mid")
tok_seq = tokenizer(midi_path)

# Access token IDs (for training models)
token_ids = tok_seq.ids
print(f"Sequence length: {len(token_ids)}")
print(f"Token IDs: {token_ids[:10]}...")  # First 10 tokens

# Access human-readable tokens
print(f"Token strings: {tok_seq.tokens[:10]}")

Complete Pipeline Example

from miditok import MusicTokenizer
from pathlib import Path

# Load tokenizer
tok = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")

# 1. MIDI → Tokens
midi = Path("input.mid")
tok_seq = tok(midi)

print(f"Original MIDI tokenized:")
print(f"  - Tokens: {tok_seq.tokens[:5]}...")
print(f"  - IDs: {tok_seq.ids[:5]}...")
print(f"  - Length: {len(tok_seq.ids)}")

# 2. Tokens → MIDI (reconstruction)
score = tok.decode(tok_seq.ids)
score.dump_midi("reconstructed.mid")

# 3. Verify reconstruction
tok_seq_reconstructed = tok("reconstructed.mid")
assert tok_seq.ids == tok_seq_reconstructed.ids, "Reconstruction failed!"
print("\n✓ Perfect reconstruction verified!")

Integration with Transformers

import torch
from miditok import MusicTokenizer
from transformers import BertForMaskedLM

# Load tokenizer and model
tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
model = BertForMaskedLM.from_pretrained("your-musicbert-model")

# Tokenize MIDI
midi_path = "song.mid"
tok_seq = tokenizer(midi_path)
input_ids = torch.tensor([tok_seq.ids[:512]])  # Truncate to max length

# Forward pass
with torch.no_grad():
    outputs = model(input_ids=input_ids)
    logits = outputs.logits

# Generate predictions
predictions = logits.argmax(dim=-1)

Batch Processing

from miditok import MusicTokenizer
from pathlib import Path
import torch

tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")

# Process multiple MIDI files
midi_files = list(Path("midi_dataset/").glob("*.mid"))
all_sequences = []

for midi_file in midi_files[:100]:  # Process first 100 files
    try:
        tok_seq = tokenizer(midi_file)
        all_sequences.append(tok_seq.ids)
    except Exception as e:
        print(f"Error processing {midi_file}: {e}")

# Pad sequences for batch processing
max_len = 2048
padded_sequences = []
for seq in all_sequences:
    if len(seq) > max_len:
        seq = seq[:max_len]  # Truncate
    else:
        seq = seq + [0] * (max_len - len(seq))  # Pad with PAD token
    padded_sequences.append(seq)

batch = torch.tensor(padded_sequences)
print(f"Batch shape: {batch.shape}")  # [batch_size, max_len]

Example Output

For a simple MIDI file with a few notes:

TokSequence(
    tokens=[
        'Bar_None',          # New measure
        'TimeSig_4/4',       # 4/4 time signature
        'Position_0',        # Start of measure
        'Tempo_121.29',      # Tempo = 121 BPM
        'Program_0',         # Piano instrument
        'Pitch_69',          # A4 note
        'Velocity_63',       # Medium velocity
        'Duration_4.0.4',    # Quarter note duration
        'Position_16',       # 16 ticks later
        'Program_0',         # Piano
        'Pitch_72',          # C5 note
        'Velocity_63',       # Medium velocity
        'Duration_2.0.8',    # Eighth note duration
        'Program_0',         # Piano
        'Pitch_76',          # E5 note
        'Velocity_63',       # Medium velocity
        'Duration_2.0.8'     # Eighth note duration
    ],
    ids=[532, 4, 531, 190, 374, 580, 850, 2595, 33442, 686],
    # BPE compression: 17 REMI tokens → 10 BPE tokens (41% compression!)
)

Training Details

Training Data

Dataset: GigaMIDI v2.0.0 (Metacreation/GigaMIDI on HuggingFace)
Size: ~200k MIDI files
Preprocessing: MIDI → REMI tokens → BPE vocabulary learning

Acknowledgments

MidiTok Library: Nathan Fradet et al.
GigaMIDI Dataset: Metacreation Lab

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

manoskary
/

miditok-REMI