DC-TTS Geralt Voice Model
A Deep Convolutional Text-to-Speech (DC-TTS) model trained to synthesize speech in the voice of Geralt of Rivia from The Witcher series.
Model Description
This model is part of the Deepstory project, which combines Natural Language Generation, Text-to-Speech, and animation technologies to create interactive storytelling experiences.
The DC-TTS architecture is based on the paper:
Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara. "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention" (arXiv:1710.08969)
Model Architecture
This model consists of two components:
Text2Mel Network
Converts text input to mel-spectrograms.
| Parameter | Value |
|---|---|
| Embedding Dimension (e) | 128 |
| Hidden Unit Dimension (d) | 512 |
| Vocabulary | PE abcdefghijklmnopqrstuvwxyz'.,!? |
| Max Characters (N) | 259 |
| Max Mel Frames (T) | 326 |
| Basic Block Type | Gated Convolution |
| Normalization | Layer Normalization |
| Dropout Rate | 0.05 |
SSRN (Spectrogram Super-Resolution Network)
Upsamples mel-spectrograms to full spectrograms for audio synthesis.
| Parameter | Value |
|---|---|
| Hidden Unit Dimension (c) | 640 (512 + 128) |
| Number of Mel Bins (f) | 80 |
| FFT Points | 2048 |
| Full Spectrogram Dimension | 1025 |
| Reduction Rate | 4 |
| Basic Block Type | Residual |
| Normalization | Weight Normalization |
| Weight Initialization | Kaiming |
Audio Parameters
| Parameter | Value |
|---|---|
| Sample Rate | 22050 Hz |
| Frame Shift | 0.0125s (12.5ms) |
| Frame Length | 0.05s (50ms) |
| Hop Length | 276 samples |
| Win Length | 1102 samples |
| Power | 1.5 |
| Preemphasis | 0.97 |
| Max dB | 100 |
| Reference dB | 20 |
| Griffin-Lim Iterations | 50 |
Files
t2m_step-102000_first.pth- Text2Mel model checkpointssrn.pth- SSRN model checkpoint
Usage
import torch
from modules.dctts import Text2Mel, SSRN, hp, spectrogram2wav
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load models
text2mel = Text2Mel(hp.vocab).to(device).eval()
text2mel.load_state_dict(torch.load('t2m_step-102000_first.pth', map_location=device)['state_dict'])
ssrn = SSRN().to(device).eval()
ssrn.load_state_dict(torch.load('ssrn.pth', map_location=device)['state_dict'])
# Synthesize speech
def synthesize(text, timeout=10000):
normalized_text = normalize_text(text) + "E" # E: EOS
L = torch.from_numpy(np.array([[hp.char2idx[char] for char in normalized_text]], np.long)).to(device)
zeros = torch.from_numpy(np.zeros((1, hp.n_mels, 1), np.float32)).to(device)
Y = zeros
with torch.no_grad():
for i in range(timeout):
_, Y_t, A = text2mel(L, Y, monotonic_attention=True)
Y = torch.cat((zeros, Y_t), -1)
_, attention = torch.max(A[0, :, -1], 0)
if L[0, attention.item()] == hp.vocab.index('E'):
break
_, Z = ssrn(Y)
Z = Z.cpu().numpy()
wav = spectrogram2wav(Z[0, :, :].T)
return wav
Training Data
The model was trained on audio samples of Geralt's voice from The Witcher 3: Wild Hunt video game.
Intended Use
This model is intended for:
- Research and experimentation in speech synthesis
- Creative projects and fan content
- Educational purposes
Limitations
- The model works best with English text
- Vocabulary is limited to lowercase letters and basic punctuation
- Audio quality may vary depending on input text complexity
- The character voice is based on copyrighted material
Citation
If you use this model, please cite the original DC-TTS paper and the Deepstory project:
@article{tachibana2018efficiently,
title={Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention},
author={Tachibana, Hideyuki and Uenoyama, Katsuya and Aihara, Shunsuke},
journal={arXiv preprint arXiv:1710.08969},
year={2018}
}
@misc{deepstory,
author = {Siu King Wai},
title = {Deepstory},
year = {2020},
publisher = {GitHub},
url = {https://github.com/thetobysiu/deepstory}
}
License
This model is released under the MIT License. Please note that the voice characteristics are based on copyrighted material from The Witcher 3: Wild Hunt.
Acknowledgments
- Original DC-TTS implementation: tugstugi/pytorch-dc-tts
- The Witcher 3: Wild Hunt by CD Projekt Red