swapnil6969 commited on
Commit
88a3d9c
·
verified ·
1 Parent(s): 4d2e4dc

Upload 8 files

Browse files
Files changed (8) hide show
  1. .gitignore +13 -0
  2. LICENSE +7 -0
  3. README.md +70 -13
  4. demo_part1.ipynb +236 -0
  5. demo_part2.ipynb +195 -0
  6. demo_part3.ipynb +145 -0
  7. requirements.txt +16 -0
  8. setup.py +45 -0
.gitignore ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ .ipynb_checkpoints/
3
+ processed
4
+ outputs
5
+ outputs_v2
6
+ checkpoints
7
+ checkpoints_v2
8
+ trash
9
+ examples*
10
+ .env
11
+ build
12
+ *.egg-info/
13
+ *.zip
LICENSE ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ Copyright 2024 MyShell.ai
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4
+
5
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6
+
7
+ THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
README.md CHANGED
@@ -1,13 +1,70 @@
1
- ---
2
- title: Openvoice Api
3
- emoji: 🦀
4
- colorFrom: red
5
- colorTo: green
6
- sdk: gradio
7
- sdk_version: 5.34.2
8
- app_file: app.py
9
- pinned: false
10
- short_description: 'Short Voice over agents '
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+ <div>&nbsp;</div>
3
+ <img src="resources/openvoicelogo.jpg" width="400"/>
4
+
5
+ [Paper](https://arxiv.org/abs/2312.01479) |
6
+ [Website](https://research.myshell.ai/open-voice) <br> <br>
7
+ <a href="https://trendshift.io/repositories/6161" target="_blank"><img src="https://trendshift.io/api/badge/repositories/6161" alt="myshell-ai%2FOpenVoice | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
8
+ </div>
9
+
10
+ ## Introduction
11
+
12
+ ### OpenVoice V1
13
+
14
+ As we detailed in our [paper](https://arxiv.org/abs/2312.01479) and [website](https://research.myshell.ai/open-voice), the advantages of OpenVoice are three-fold:
15
+
16
+ **1. Accurate Tone Color Cloning.**
17
+ OpenVoice can accurately clone the reference tone color and generate speech in multiple languages and accents.
18
+
19
+ **2. Flexible Voice Style Control.**
20
+ OpenVoice enables granular control over voice styles, such as emotion and accent, as well as other style parameters including rhythm, pauses, and intonation.
21
+
22
+ **3. Zero-shot Cross-lingual Voice Cloning.**
23
+ Neither of the language of the generated speech nor the language of the reference speech needs to be presented in the massive-speaker multi-lingual training dataset.
24
+
25
+ ### OpenVoice V2
26
+
27
+ In April 2024, we released OpenVoice V2, which includes all features in V1 and has:
28
+
29
+ **1. Better Audio Quality.**
30
+ OpenVoice V2 adopts a different training strategy that delivers better audio quality.
31
+
32
+ **2. Native Multi-lingual Support.**
33
+ English, Spanish, French, Chinese, Japanese and Korean are natively supported in OpenVoice V2.
34
+
35
+ **3. Free Commercial Use.**
36
+ Starting from April 2024, both V2 and V1 are released under MIT License. Free for commercial use.
37
+
38
+ [Video](https://github.com/myshell-ai/OpenVoice/assets/40556743/3cba936f-82bf-476c-9e52-09f0f417bb2f)
39
+
40
+ OpenVoice has been powering the instant voice cloning capability of [myshell.ai](https://app.myshell.ai/explore) since May 2023. Until Nov 2023, the voice cloning model has been used tens of millions of times by users worldwide, and witnessed the explosive user growth on the platform.
41
+
42
+ ## Main Contributors
43
+
44
+ - [Zengyi Qin](https://www.qinzy.tech) at MIT
45
+ - [Wenliang Zhao](https://wl-zhao.github.io) at Tsinghua University
46
+ - [Xumin Yu](https://yuxumin.github.io) at Tsinghua University
47
+ - [Ethan Sun](https://twitter.com/ethan_myshell) at MyShell
48
+
49
+ ## How to Use
50
+ Please see [usage](docs/USAGE.md) for detailed instructions.
51
+
52
+ ## Common Issues
53
+
54
+ Please see [QA](docs/QA.md) for common questions and answers. We will regularly update the question and answer list.
55
+
56
+ ## Citation
57
+ ```
58
+ @article{qin2023openvoice,
59
+ title={OpenVoice: Versatile Instant Voice Cloning},
60
+ author={Qin, Zengyi and Zhao, Wenliang and Yu, Xumin and Sun, Xin},
61
+ journal={arXiv preprint arXiv:2312.01479},
62
+ year={2023}
63
+ }
64
+ ```
65
+
66
+ ## License
67
+ OpenVoice V1 and V2 are MIT Licensed. Free for both commercial and research use.
68
+
69
+ ## Acknowledgements
70
+ This implementation is based on several excellent projects, [TTS](https://github.com/coqui-ai/TTS), [VITS](https://github.com/jaywalnut310/vits), and [VITS2](https://github.com/daniilrobnikov/vits2). Thanks for their awesome work!
demo_part1.ipynb ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "b6ee1ede",
6
+ "metadata": {},
7
+ "source": [
8
+ "## Voice Style Control Demo"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "code",
13
+ "execution_count": null,
14
+ "id": "b7f043ee",
15
+ "metadata": {},
16
+ "outputs": [],
17
+ "source": [
18
+ "import os\n",
19
+ "import torch\n",
20
+ "from openvoice import se_extractor\n",
21
+ "from openvoice.api import BaseSpeakerTTS, ToneColorConverter"
22
+ ]
23
+ },
24
+ {
25
+ "cell_type": "markdown",
26
+ "id": "15116b59",
27
+ "metadata": {},
28
+ "source": [
29
+ "### Initialization"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "code",
34
+ "execution_count": null,
35
+ "id": "aacad912",
36
+ "metadata": {},
37
+ "outputs": [],
38
+ "source": [
39
+ "ckpt_base = 'checkpoints/base_speakers/EN'\n",
40
+ "ckpt_converter = 'checkpoints/converter'\n",
41
+ "device=\"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n",
42
+ "output_dir = 'outputs'\n",
43
+ "\n",
44
+ "base_speaker_tts = BaseSpeakerTTS(f'{ckpt_base}/config.json', device=device)\n",
45
+ "base_speaker_tts.load_ckpt(f'{ckpt_base}/checkpoint.pth')\n",
46
+ "\n",
47
+ "tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)\n",
48
+ "tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')\n",
49
+ "\n",
50
+ "os.makedirs(output_dir, exist_ok=True)"
51
+ ]
52
+ },
53
+ {
54
+ "cell_type": "markdown",
55
+ "id": "7f67740c",
56
+ "metadata": {},
57
+ "source": [
58
+ "### Obtain Tone Color Embedding"
59
+ ]
60
+ },
61
+ {
62
+ "cell_type": "markdown",
63
+ "id": "f8add279",
64
+ "metadata": {},
65
+ "source": [
66
+ "The `source_se` is the tone color embedding of the base speaker. \n",
67
+ "It is an average of multiple sentences generated by the base speaker. We directly provide the result here but\n",
68
+ "the readers feel free to extract `source_se` by themselves."
69
+ ]
70
+ },
71
+ {
72
+ "cell_type": "code",
73
+ "execution_count": null,
74
+ "id": "63ff6273",
75
+ "metadata": {},
76
+ "outputs": [],
77
+ "source": [
78
+ "source_se = torch.load(f'{ckpt_base}/en_default_se.pth').to(device)"
79
+ ]
80
+ },
81
+ {
82
+ "cell_type": "markdown",
83
+ "id": "4f71fcc3",
84
+ "metadata": {},
85
+ "source": [
86
+ "The `reference_speaker.mp3` below points to the short audio clip of the reference whose voice we want to clone. We provide an example here. If you use your own reference speakers, please **make sure each speaker has a unique filename.** The `se_extractor` will save the `targeted_se` using the filename of the audio and **will not automatically overwrite.**"
87
+ ]
88
+ },
89
+ {
90
+ "cell_type": "code",
91
+ "execution_count": null,
92
+ "id": "55105eae",
93
+ "metadata": {},
94
+ "outputs": [],
95
+ "source": [
96
+ "reference_speaker = 'resources/example_reference.mp3' # This is the voice you want to clone\n",
97
+ "target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, target_dir='processed', vad=True)"
98
+ ]
99
+ },
100
+ {
101
+ "cell_type": "markdown",
102
+ "id": "a40284aa",
103
+ "metadata": {},
104
+ "source": [
105
+ "### Inference"
106
+ ]
107
+ },
108
+ {
109
+ "cell_type": "code",
110
+ "execution_count": null,
111
+ "id": "73dc1259",
112
+ "metadata": {},
113
+ "outputs": [],
114
+ "source": [
115
+ "save_path = f'{output_dir}/output_en_default.wav'\n",
116
+ "\n",
117
+ "# Run the base speaker tts\n",
118
+ "text = \"This audio is generated by OpenVoice.\"\n",
119
+ "src_path = f'{output_dir}/tmp.wav'\n",
120
+ "base_speaker_tts.tts(text, src_path, speaker='default', language='English', speed=1.0)\n",
121
+ "\n",
122
+ "# Run the tone color converter\n",
123
+ "encode_message = \"@MyShell\"\n",
124
+ "tone_color_converter.convert(\n",
125
+ " audio_src_path=src_path, \n",
126
+ " src_se=source_se, \n",
127
+ " tgt_se=target_se, \n",
128
+ " output_path=save_path,\n",
129
+ " message=encode_message)"
130
+ ]
131
+ },
132
+ {
133
+ "cell_type": "markdown",
134
+ "id": "6e3ea28a",
135
+ "metadata": {},
136
+ "source": [
137
+ "**Try with different styles and speed.** The style can be controlled by the `speaker` parameter in the `base_speaker_tts.tts` method. Available choices: friendly, cheerful, excited, sad, angry, terrified, shouting, whispering. Note that the tone color embedding need to be updated. The speed can be controlled by the `speed` parameter. Let's try whispering with speed 0.9."
138
+ ]
139
+ },
140
+ {
141
+ "cell_type": "code",
142
+ "execution_count": null,
143
+ "id": "fd022d38",
144
+ "metadata": {},
145
+ "outputs": [],
146
+ "source": [
147
+ "source_se = torch.load(f'{ckpt_base}/en_style_se.pth').to(device)\n",
148
+ "save_path = f'{output_dir}/output_whispering.wav'\n",
149
+ "\n",
150
+ "# Run the base speaker tts\n",
151
+ "text = \"This audio is generated by OpenVoice.\"\n",
152
+ "src_path = f'{output_dir}/tmp.wav'\n",
153
+ "base_speaker_tts.tts(text, src_path, speaker='whispering', language='English', speed=0.9)\n",
154
+ "\n",
155
+ "# Run the tone color converter\n",
156
+ "encode_message = \"@MyShell\"\n",
157
+ "tone_color_converter.convert(\n",
158
+ " audio_src_path=src_path, \n",
159
+ " src_se=source_se, \n",
160
+ " tgt_se=target_se, \n",
161
+ " output_path=save_path,\n",
162
+ " message=encode_message)"
163
+ ]
164
+ },
165
+ {
166
+ "cell_type": "markdown",
167
+ "id": "5fcfc70b",
168
+ "metadata": {},
169
+ "source": [
170
+ "**Try with different languages.** OpenVoice can achieve multi-lingual voice cloning by simply replace the base speaker. We provide an example with a Chinese base speaker here and we encourage the readers to try `demo_part2.ipynb` for a detailed demo."
171
+ ]
172
+ },
173
+ {
174
+ "cell_type": "code",
175
+ "execution_count": null,
176
+ "id": "a71d1387",
177
+ "metadata": {},
178
+ "outputs": [],
179
+ "source": [
180
+ "\n",
181
+ "ckpt_base = 'checkpoints/base_speakers/ZH'\n",
182
+ "base_speaker_tts = BaseSpeakerTTS(f'{ckpt_base}/config.json', device=device)\n",
183
+ "base_speaker_tts.load_ckpt(f'{ckpt_base}/checkpoint.pth')\n",
184
+ "\n",
185
+ "source_se = torch.load(f'{ckpt_base}/zh_default_se.pth').to(device)\n",
186
+ "save_path = f'{output_dir}/output_chinese.wav'\n",
187
+ "\n",
188
+ "# Run the base speaker tts\n",
189
+ "text = \"今天天气真好,我们一起出去吃饭吧。\"\n",
190
+ "src_path = f'{output_dir}/tmp.wav'\n",
191
+ "base_speaker_tts.tts(text, src_path, speaker='default', language='Chinese', speed=1.0)\n",
192
+ "\n",
193
+ "# Run the tone color converter\n",
194
+ "encode_message = \"@MyShell\"\n",
195
+ "tone_color_converter.convert(\n",
196
+ " audio_src_path=src_path, \n",
197
+ " src_se=source_se, \n",
198
+ " tgt_se=target_se, \n",
199
+ " output_path=save_path,\n",
200
+ " message=encode_message)"
201
+ ]
202
+ },
203
+ {
204
+ "cell_type": "markdown",
205
+ "id": "8e513094",
206
+ "metadata": {},
207
+ "source": [
208
+ "**Tech for good.** For people who will deploy OpenVoice for public usage: We offer you the option to add watermark to avoid potential misuse. Please see the ToneColorConverter class. **MyShell reserves the ability to detect whether an audio is generated by OpenVoice**, no matter whether the watermark is added or not."
209
+ ]
210
+ }
211
+ ],
212
+ "metadata": {
213
+ "interpreter": {
214
+ "hash": "9d70c38e1c0b038dbdffdaa4f8bfa1f6767c43760905c87a9fbe7800d18c6c35"
215
+ },
216
+ "kernelspec": {
217
+ "display_name": "Python 3 (ipykernel)",
218
+ "language": "python",
219
+ "name": "python3"
220
+ },
221
+ "language_info": {
222
+ "codemirror_mode": {
223
+ "name": "ipython",
224
+ "version": 3
225
+ },
226
+ "file_extension": ".py",
227
+ "mimetype": "text/x-python",
228
+ "name": "python",
229
+ "nbconvert_exporter": "python",
230
+ "pygments_lexer": "ipython3",
231
+ "version": "3.9.18"
232
+ }
233
+ },
234
+ "nbformat": 4,
235
+ "nbformat_minor": 5
236
+ }
demo_part2.ipynb ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "b6ee1ede",
6
+ "metadata": {},
7
+ "source": [
8
+ "## Cross-Lingual Voice Clone Demo"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "code",
13
+ "execution_count": null,
14
+ "id": "b7f043ee",
15
+ "metadata": {},
16
+ "outputs": [],
17
+ "source": [
18
+ "import os\n",
19
+ "import torch\n",
20
+ "from openvoice import se_extractor\n",
21
+ "from openvoice.api import ToneColorConverter"
22
+ ]
23
+ },
24
+ {
25
+ "cell_type": "markdown",
26
+ "id": "15116b59",
27
+ "metadata": {},
28
+ "source": [
29
+ "### Initialization"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "code",
34
+ "execution_count": null,
35
+ "id": "aacad912",
36
+ "metadata": {},
37
+ "outputs": [],
38
+ "source": [
39
+ "ckpt_converter = 'checkpoints/converter'\n",
40
+ "device=\"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n",
41
+ "output_dir = 'outputs'\n",
42
+ "\n",
43
+ "tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)\n",
44
+ "tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')\n",
45
+ "\n",
46
+ "os.makedirs(output_dir, exist_ok=True)"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "markdown",
51
+ "id": "3db80fcf",
52
+ "metadata": {},
53
+ "source": [
54
+ "In this demo, we will use OpenAI TTS as the base speaker to produce multi-lingual speech audio. The users can flexibly change the base speaker according to their own needs. Please create a file named `.env` and place OpenAI key as `OPENAI_API_KEY=xxx`. We have also provided a Chinese base speaker model (see `demo_part1.ipynb`)."
55
+ ]
56
+ },
57
+ {
58
+ "cell_type": "code",
59
+ "execution_count": null,
60
+ "id": "3b245ca3",
61
+ "metadata": {},
62
+ "outputs": [],
63
+ "source": [
64
+ "from openai import OpenAI\n",
65
+ "from dotenv import load_dotenv\n",
66
+ "\n",
67
+ "# Please create a file named .env and place your\n",
68
+ "# OpenAI key as OPENAI_API_KEY=xxx\n",
69
+ "load_dotenv() \n",
70
+ "\n",
71
+ "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\"))\n",
72
+ "\n",
73
+ "response = client.audio.speech.create(\n",
74
+ " model=\"tts-1\",\n",
75
+ " voice=\"nova\",\n",
76
+ " input=\"This audio will be used to extract the base speaker tone color embedding. \" + \\\n",
77
+ " \"Typically a very short audio should be sufficient, but increasing the audio \" + \\\n",
78
+ " \"length will also improve the output audio quality.\"\n",
79
+ ")\n",
80
+ "\n",
81
+ "response.stream_to_file(f\"{output_dir}/openai_source_output.mp3\")"
82
+ ]
83
+ },
84
+ {
85
+ "cell_type": "markdown",
86
+ "id": "7f67740c",
87
+ "metadata": {},
88
+ "source": [
89
+ "### Obtain Tone Color Embedding"
90
+ ]
91
+ },
92
+ {
93
+ "cell_type": "markdown",
94
+ "id": "f8add279",
95
+ "metadata": {},
96
+ "source": [
97
+ "The `source_se` is the tone color embedding of the base speaker. \n",
98
+ "It is an average for multiple sentences with multiple emotions\n",
99
+ "of the base speaker. We directly provide the result here but\n",
100
+ "the readers feel free to extract `source_se` by themselves."
101
+ ]
102
+ },
103
+ {
104
+ "cell_type": "code",
105
+ "execution_count": null,
106
+ "id": "63ff6273",
107
+ "metadata": {},
108
+ "outputs": [],
109
+ "source": [
110
+ "base_speaker = f\"{output_dir}/openai_source_output.mp3\"\n",
111
+ "source_se, audio_name = se_extractor.get_se(base_speaker, tone_color_converter, vad=True)\n",
112
+ "\n",
113
+ "reference_speaker = 'resources/example_reference.mp3' # This is the voice you want to clone\n",
114
+ "target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, vad=True)"
115
+ ]
116
+ },
117
+ {
118
+ "cell_type": "markdown",
119
+ "id": "a40284aa",
120
+ "metadata": {},
121
+ "source": [
122
+ "### Inference"
123
+ ]
124
+ },
125
+ {
126
+ "cell_type": "code",
127
+ "execution_count": null,
128
+ "id": "73dc1259",
129
+ "metadata": {},
130
+ "outputs": [],
131
+ "source": [
132
+ "# Run the base speaker tts\n",
133
+ "text = [\n",
134
+ " \"MyShell is a decentralized and comprehensive platform for discovering, creating, and staking AI-native apps.\",\n",
135
+ " \"MyShell es una plataforma descentralizada y completa para descubrir, crear y apostar por aplicaciones nativas de IA.\",\n",
136
+ " \"MyShell est une plateforme décentralisée et complète pour découvrir, créer et miser sur des applications natives d'IA.\",\n",
137
+ " \"MyShell ist eine dezentralisierte und umfassende Plattform zum Entdecken, Erstellen und Staken von KI-nativen Apps.\",\n",
138
+ " \"MyShell è una piattaforma decentralizzata e completa per scoprire, creare e scommettere su app native di intelligenza artificiale.\",\n",
139
+ " \"MyShellは、AIネイティブアプリの発見、作成、およびステーキングのための分散型かつ包括的なプラットフォームです。\",\n",
140
+ " \"MyShell — это децентрализованная и всеобъемлющая платформа для обнаружения, создания и стейкинга AI-ориентированных приложений.\",\n",
141
+ " \"MyShell هي منصة لامركزية وشاملة لاكتشاف وإنشاء ورهان تطبيقات الذكاء الاصطناعي الأصلية.\",\n",
142
+ " \"MyShell是一个去中心化且全面的平台,用于发现、创建和投资AI原生应用程序。\",\n",
143
+ " \"MyShell एक विकेंद्रीकृत और व्यापक मंच है, जो AI-मूल ऐप्स की खोज, सृजन और स्टेकिंग के लिए है।\",\n",
144
+ " \"MyShell é uma plataforma descentralizada e abrangente para descobrir, criar e apostar em aplicativos nativos de IA.\"\n",
145
+ "]\n",
146
+ "src_path = f'{output_dir}/tmp.wav'\n",
147
+ "\n",
148
+ "for i, t in enumerate(text):\n",
149
+ "\n",
150
+ " response = client.audio.speech.create(\n",
151
+ " model=\"tts-1\",\n",
152
+ " voice=\"nova\",\n",
153
+ " input=t,\n",
154
+ " )\n",
155
+ "\n",
156
+ " response.stream_to_file(src_path)\n",
157
+ "\n",
158
+ " save_path = f'{output_dir}/output_crosslingual_{i}.wav'\n",
159
+ "\n",
160
+ " # Run the tone color converter\n",
161
+ " encode_message = \"@MyShell\"\n",
162
+ " tone_color_converter.convert(\n",
163
+ " audio_src_path=src_path, \n",
164
+ " src_se=source_se, \n",
165
+ " tgt_se=target_se, \n",
166
+ " output_path=save_path,\n",
167
+ " message=encode_message)"
168
+ ]
169
+ }
170
+ ],
171
+ "metadata": {
172
+ "interpreter": {
173
+ "hash": "9d70c38e1c0b038dbdffdaa4f8bfa1f6767c43760905c87a9fbe7800d18c6c35"
174
+ },
175
+ "kernelspec": {
176
+ "display_name": "Python 3 (ipykernel)",
177
+ "language": "python",
178
+ "name": "python3"
179
+ },
180
+ "language_info": {
181
+ "codemirror_mode": {
182
+ "name": "ipython",
183
+ "version": 3
184
+ },
185
+ "file_extension": ".py",
186
+ "mimetype": "text/x-python",
187
+ "name": "python",
188
+ "nbconvert_exporter": "python",
189
+ "pygments_lexer": "ipython3",
190
+ "version": "3.9.18"
191
+ }
192
+ },
193
+ "nbformat": 4,
194
+ "nbformat_minor": 5
195
+ }
demo_part3.ipynb ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "## Multi-Accent and Multi-Lingual Voice Clone Demo with MeloTTS"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "code",
12
+ "execution_count": null,
13
+ "metadata": {},
14
+ "outputs": [],
15
+ "source": [
16
+ "import os\n",
17
+ "import torch\n",
18
+ "from openvoice import se_extractor\n",
19
+ "from openvoice.api import ToneColorConverter"
20
+ ]
21
+ },
22
+ {
23
+ "cell_type": "markdown",
24
+ "metadata": {},
25
+ "source": [
26
+ "### Initialization\n",
27
+ "\n",
28
+ "In this example, we will use the checkpoints from OpenVoiceV2. OpenVoiceV2 is trained with more aggressive augmentations and thus demonstrate better robustness in some cases."
29
+ ]
30
+ },
31
+ {
32
+ "cell_type": "code",
33
+ "execution_count": null,
34
+ "metadata": {},
35
+ "outputs": [],
36
+ "source": [
37
+ "ckpt_converter = 'checkpoints_v2/converter'\n",
38
+ "device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n",
39
+ "output_dir = 'outputs_v2'\n",
40
+ "\n",
41
+ "tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)\n",
42
+ "tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')\n",
43
+ "\n",
44
+ "os.makedirs(output_dir, exist_ok=True)"
45
+ ]
46
+ },
47
+ {
48
+ "cell_type": "markdown",
49
+ "metadata": {},
50
+ "source": [
51
+ "### Obtain Tone Color Embedding\n",
52
+ "We only extract the tone color embedding for the target speaker. The source tone color embeddings can be directly loaded from `checkpoints_v2/ses` folder."
53
+ ]
54
+ },
55
+ {
56
+ "cell_type": "code",
57
+ "execution_count": null,
58
+ "metadata": {},
59
+ "outputs": [],
60
+ "source": [
61
+ "\n",
62
+ "reference_speaker = 'resources/example_reference.mp3' # This is the voice you want to clone\n",
63
+ "target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, vad=True)"
64
+ ]
65
+ },
66
+ {
67
+ "cell_type": "markdown",
68
+ "metadata": {},
69
+ "source": [
70
+ "#### Use MeloTTS as Base Speakers\n",
71
+ "\n",
72
+ "MeloTTS is a high-quality multi-lingual text-to-speech library by @MyShell.ai, supporting languages including English (American, British, Indian, Australian, Default), Spanish, French, Chinese, Japanese, Korean. In the following example, we will use the models in MeloTTS as the base speakers. "
73
+ ]
74
+ },
75
+ {
76
+ "cell_type": "code",
77
+ "execution_count": null,
78
+ "metadata": {},
79
+ "outputs": [],
80
+ "source": [
81
+ "from melo.api import TTS\n",
82
+ "\n",
83
+ "texts = {\n",
84
+ " 'EN_NEWEST': \"Did you ever hear a folk tale about a giant turtle?\", # The newest English base speaker model\n",
85
+ " 'EN': \"Did you ever hear a folk tale about a giant turtle?\",\n",
86
+ " 'ES': \"El resplandor del sol acaricia las olas, pintando el cielo con una paleta deslumbrante.\",\n",
87
+ " 'FR': \"La lueur dorée du soleil caresse les vagues, peignant le ciel d'une palette éblouissante.\",\n",
88
+ " 'ZH': \"在这次vacation中,我们计划去Paris欣赏埃菲尔铁塔和卢浮宫的美景。\",\n",
89
+ " 'JP': \"彼は毎朝ジョギングをして体を健康に保っています。\",\n",
90
+ " 'KR': \"안녕하세요! 오늘은 날씨가 정말 좋네요.\",\n",
91
+ "}\n",
92
+ "\n",
93
+ "\n",
94
+ "src_path = f'{output_dir}/tmp.wav'\n",
95
+ "\n",
96
+ "# Speed is adjustable\n",
97
+ "speed = 1.0\n",
98
+ "\n",
99
+ "for language, text in texts.items():\n",
100
+ " model = TTS(language=language, device=device)\n",
101
+ " speaker_ids = model.hps.data.spk2id\n",
102
+ " \n",
103
+ " for speaker_key in speaker_ids.keys():\n",
104
+ " speaker_id = speaker_ids[speaker_key]\n",
105
+ " speaker_key = speaker_key.lower().replace('_', '-')\n",
106
+ " \n",
107
+ " source_se = torch.load(f'checkpoints_v2/base_speakers/ses/{speaker_key}.pth', map_location=device)\n",
108
+ " if torch.backends.mps.is_available() and device == 'cpu':\n",
109
+ " torch.backends.mps.is_available = lambda: False\n",
110
+ " model.tts_to_file(text, speaker_id, src_path, speed=speed)\n",
111
+ " save_path = f'{output_dir}/output_v2_{speaker_key}.wav'\n",
112
+ "\n",
113
+ " # Run the tone color converter\n",
114
+ " encode_message = \"@MyShell\"\n",
115
+ " tone_color_converter.convert(\n",
116
+ " audio_src_path=src_path, \n",
117
+ " src_se=source_se, \n",
118
+ " tgt_se=target_se, \n",
119
+ " output_path=save_path,\n",
120
+ " message=encode_message)"
121
+ ]
122
+ }
123
+ ],
124
+ "metadata": {
125
+ "kernelspec": {
126
+ "display_name": "melo",
127
+ "language": "python",
128
+ "name": "python3"
129
+ },
130
+ "language_info": {
131
+ "codemirror_mode": {
132
+ "name": "ipython",
133
+ "version": 3
134
+ },
135
+ "file_extension": ".py",
136
+ "mimetype": "text/x-python",
137
+ "name": "python",
138
+ "nbconvert_exporter": "python",
139
+ "pygments_lexer": "ipython3",
140
+ "version": "3.9.18"
141
+ }
142
+ },
143
+ "nbformat": 4,
144
+ "nbformat_minor": 2
145
+ }
requirements.txt ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ librosa==0.9.1
2
+ faster-whisper==0.9.0
3
+ pydub==0.25.1
4
+ wavmark==0.0.3
5
+ numpy==1.22.0
6
+ eng_to_ipa==0.0.2
7
+ inflect==7.0.0
8
+ unidecode==1.3.7
9
+ whisper-timestamped==1.14.2
10
+ openai
11
+ python-dotenv
12
+ pypinyin==0.50.0
13
+ cn2an==0.5.22
14
+ jieba==0.42.1
15
+ gradio==3.48.0
16
+ langid==1.1.6
setup.py ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from setuptools import setup, find_packages
2
+
3
+
4
+ setup(name='MyShell-OpenVoice',
5
+ version='0.0.0',
6
+ description='Instant voice cloning by MyShell.',
7
+ long_description=open('README.md').read().strip(),
8
+ long_description_content_type='text/markdown',
9
+ keywords=[
10
+ 'text-to-speech',
11
+ 'tts',
12
+ 'voice-clone',
13
+ 'zero-shot-tts'
14
+ ],
15
+ url='https://github.com/myshell-ai/OpenVoice',
16
+ project_urls={
17
+ 'Documentation': 'https://github.com/myshell-ai/OpenVoice/blob/main/docs/USAGE.md',
18
+ 'Changes': 'https://github.com/myshell-ai/OpenVoice/releases',
19
+ 'Code': 'https://github.com/myshell-ai/OpenVoice',
20
+ 'Issue tracker': 'https://github.com/myshell-ai/OpenVoice/issues',
21
+ },
22
+ author='MyShell',
23
+ author_email='[email protected]',
24
+ license='MIT License',
25
+ packages=find_packages(),
26
+
27
+ python_requires='>=3.9',
28
+ install_requires=[
29
+ 'librosa==0.9.1',
30
+ 'faster-whisper==0.9.0',
31
+ 'pydub==0.25.1',
32
+ 'wavmark==0.0.3',
33
+ 'numpy==1.22.0',
34
+ 'eng_to_ipa==0.0.2',
35
+ 'inflect==7.0.0',
36
+ 'unidecode==1.3.7',
37
+ 'whisper-timestamped==1.14.2',
38
+ 'pypinyin==0.50.0',
39
+ 'cn2an==0.5.22',
40
+ 'jieba==0.42.1',
41
+ 'gradio==3.48.0',
42
+ 'langid==1.1.6'
43
+ ],
44
+ zip_safe=False
45
+ )