Update README, Update python API

Files changed (9) hide show

README.md +271 -3
python/main.py +54 -0
python/requirements.txt +4 -2
python/test_wer.py +272 -0
python/whisper.py +209 -225
python/whisper_cli.py +45 -0
python/whisper_onnx.py +0 -239
python/whisper_svr.py +104 -0
python/whisper_tokenizer.py +395 -0

README.md CHANGED Viewed

@@ -5,7 +5,199 @@ pipeline_tag: automatic-speech-recognition
 # Whisper
-## CPP
 ### 服务端
@@ -16,10 +208,86 @@ cd cpp
 ### 客户端
-curl命令行测试:
 ```
 ffmpeg -i demo.wav -f f32le -c:a pcm_f32le - 2>/dev/null | \
 curl -X POST 10.126.33.192:8080/asr \
   -H "Content-Type: application/octet-stream" \
   --data-binary @-
-```

 # Whisper
+OpenAI Whisper on Axera
+- 目前支持 C++ 和 Python 两种语言
+- 预编译模型下载
+  - [Huggingface](https://huggingface.co/AXERA-TECH/Whisper)
+- 如需自行转换请参考[模型转换](https://github.com/ml-inory/whisper.axera/blob/main/model_convert/README.md)
+## 支持平台
+- [x] AX650N
+- [x] AX630C
+## 模型转换
+[模型转换](https://github.com/ml-inory/whisper.axera/blob/main/model_convert/README.md)
+## 上板部署
+- 基于 AX650N、AX630C 的设备已预装 Ubuntu22.04
+- 链接互联网，确保设备能正常执行 `apt install`, `pip install` 等指令
+- 已验证设备：
+  - [爱芯派Pro(AX650N)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
+  - [M.2 Accelerator card(AX650N)](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
+  - [爱芯派2(AX630C)](https://axera-pi-2-docs-cn.readthedocs.io/zh-cn/latest/index.html)
+  - [Module-LLM(AX630C)](https://docs.m5stack.com/zh_CN/module/Module-LLM)
+  - [LLM630 Compute Kit(AX630C)](https://docs.m5stack.com/zh_CN/core/LLM630%20Compute%20Kit)
+- 支持编程语言:
+  - [Python](#Python)
+  - [C++](#CPP)
+<h3 id="Python">Python</h3>
+#### Requirements
+推荐在板上安装Miniconda管理虚拟环境，安装方法如下:
+```
+mkdir -p ~/miniconda3
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh -O ~/miniconda3/miniconda.sh
+bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
+rm ~/miniconda3/miniconda.sh
+source ~/miniconda3/bin/activate
+conda init --all
+```
+安装Whisper依赖
+```
+cd python
+conda create -n whisper python=3.12
+conda activate whisper
+pip3 install -r requirements.txt
+```
+####  安装pyaxenigne
+参考 https://github.com/AXERA-TECH/pyaxengine 安装 NPU Python API
+在0.1.3rc2上测试通过，可通过
+```
+pip install https://github.com/AXERA-TECH/pyaxengine/releases/download/0.1.3.rc2/axengine-0.1.3-py3-none-any.whl
+```
+安装，或把版本号更改为你想使用的版本
+#### 运行
+登陆开发板后
+输入命令
+```
+cd python
+conda activate whisper
+python3 main.py --model_type small --model_path ../models-ax650 --wav ../demo.wav --language zh
+```
+输出结果
+```
+root@ax650:/mnt/qtang/whisper.axera/python# python3 main.py --wav ../demo.wav --model_type small --model_path ../models/ --language zh
+[INFO] Available providers:  ['AxEngineExecutionProvider']
+wav: ../demo.wav
+model_type: small
+model_path: ../models/
+language: zh
+[INFO] Using provider: AxEngineExecutionProvider
+[INFO] Chip type: ChipType.MC50
+[INFO] VNPU type: VNPUType.DISABLED
+[INFO] Engine version: 2.10.1s
+[INFO] Model type: 2 (triple core)
+[INFO] Compiler version: 3.2-patch1 117f5fd4
+[INFO] Using provider: AxEngineExecutionProvider
+[INFO] Model type: 2 (triple core)
+[INFO] Compiler version: 3.2-patch1 117f5fd4
+[INFO] Using provider: AxEngineExecutionProvider
+[INFO] Model type: 2 (triple core)
+[INFO] Compiler version: 3.2-patch1 117f5fd4
+Load models take 2322.563409805298ms
+Preprocess wav take 6971.68493270874ms
+Run encoder take 211.52877807617188ms
+Run decoder_main take 79.00094985961914ms
+First token: 17556
+Run decoder_loop take 101.91774368286133ms
+Iter 0   Token: 20844
+Run decoder_loop take 60.30416488647461ms
+Iter 1   Token: 7781
+Run decoder_loop take 60.22000312805176ms
+Iter 2   Token: 20204
+Run decoder_loop take 60.23716926574707ms
+Iter 3   Token: 28455
+Run decoder_loop take 60.214996337890625ms
+Iter 4   Token: 31962
+Run decoder_loop take 60.17565727233887ms
+Iter 5   Token: 6336
+Run decoder_loop take 60.94002723693848ms
+Iter 6   Token: 254
+Run decoder_loop take 60.71639060974121ms
+Iter 7   Token: 2930
+Run decoder_loop take 60.225725173950195ms
+Iter 8   Token: 236
+Run decoder_loop take 60.167789459228516ms
+Iter 9   Token: 36135
+Run decoder_loop take 60.29987335205078ms
+Iter 10          Token: 15868
+Run decoder_loop take 61.163902282714844ms
+Iter 11          Token: 252
+Run decoder_loop take 60.273170471191406ms
+Iter 12          Token: 1546
+Run decoder_loop take 60.23144721984863ms
+Iter 13          Token: 46514
+Run decoder_loop take 60.31966209411621ms
+Iter 14          Token: 50257
+Result: 甚至出现交易几乎停滞的情况
+```
+运行参数说明:
+| 参数名称 | 说明 | 默认值 |
+| --- | --- | --- |
+| --wav | 输入音频文件 | |
+| --model_type/-t | 模型类型, tiny/base/small | |
+| --model_path/-p | 模型所在目录 | ../models |
+| --language/-l | 识别语言 | zh |
+<h3 id="CPP">CPP</h3>
+#### 运行
+在 AX650N 设备上执行
+```
+cd cpp
+./whisper -w ../demo.wav
+```
+或
+```
+cd cpp
+./whisper --model_type small --model_path ../models -w ../demo.wav
+```
+输出结果
+```
+root@ax650:/mnt/qtang/whisper.axera/cpp# ./install/whisper --wav ../demo.wav --model_type small --model_path ../models/ --language zh
+wav_file: ../demo.wav
+model_path: ../models/
+model_type: small
+language: zh
+Encoder run take 188.30 ms
+First token: 17556       take 81.88ms
+Next Token: 20844        take 29.64ms
+Next Token: 7781         take 29.70ms
+Next Token: 20204        take 29.64ms
+Next Token: 28455        take 29.65ms
+Next Token: 31962        take 29.61ms
+Next Token: 6336         take 29.67ms
+Next Token: 254          take 29.63ms
+Next Token: 2930         take 29.61ms
+Next Token: 236          take 29.56ms
+Next Token: 36135        take 29.64ms
+Next Token: 15868        take 29.71ms
+Next Token: 252          take 29.51ms
+Next Token: 1546         take 29.63ms
+Next Token: 46514        take 29.51ms
+Next Token: 50257        take 29.69ms
+All take 801.13 ms
+Result: 甚至出现交易几乎停滞的情况
+```
 ### 服务端
 ### 客户端
+curl命令行测试(请自行替换IP和端口):
 ```
 ffmpeg -i demo.wav -f f32le -c:a pcm_f32le - 2>/dev/null | \
 curl -X POST 10.126.33.192:8080/asr \
   -H "Content-Type: application/octet-stream" \
   --data-binary @-
+```
+## 模型性能
+### Latency
+RTF: Real-Time Factor
+CPP:
+| Models        | AX650N | AX630C |
+| ------------- | ------ | ------ |
+| Whisper-Tiny  | 0.08   |        |
+| Whisper-Base  | 0.11   | 0.35   |
+| Whisper-Small | 0.24   |        |
+| Whisper-Turbo | 0.48   |        |
+Python:
+| Models        | AX650N | AX630C |
+| ------------- | ------ | ------ |
+| Whisper-Tiny  | 0.12   |        |
+| Whisper-Base  | 0.16   | 0.35   |
+| Whisper-Small | 0.50   |        |
+| Whisper-Turbo | 0.60   |        |
+### Word Error Rate(Test on AIShell dataset)
+| Models        | AX650N | AX630C |
+| ------------- | ------ | ------ |
+| Whisper-Tiny  |  0.24  |        |
+| Whisper-Base  |  0.18  |        |
+| Whisper-Small |  0.11  |        |
+| Whisper-Turbo |  0.06  |        |
+若要复现测试结果，请按照以下步骤:
+解压数据集:
+```
+unzip datasets.zip
+```
+运行测试脚本:
+```
+cd python
+conda activate whisper
+python test_wer.py -d aishell --gt_path ../datasets/ground_truth.txt --model_type tiny
+```
+### MEM Usage
+* CMM Stands for Physical memory used by Axera modules like VDEC(Video decoder), VENC(Video encoder), NPU, etc.
+Python:
+| Models        | CMM(MB)| OS(MB) |
+| ------------- | ------ | ------ |
+| Whisper-Tiny  |  332   |  512   |
+| Whisper-Base  |  533   |  644   |
+| Whisper-Small |  1106  |  906   |
+| Whisper-Turbo |  2065  |  2084  |
+C++:
+| Models        | CMM(MB)| OS(MB) |
+| ------------- | ------ | ------ |
+| Whisper-Tiny  |  332   |  31    |
+| Whisper-Base  |  533   |  54    |
+| Whisper-Small |  1106  |  146   |
+| Whisper-Turbo |  2065  |  86    |
+## 技术讨论
+- Github issues
+- QQ 群: 139953715

python/main.py ADDED Viewed

	@@ -0,0 +1,54 @@

+import argparse
+import os
+from whisper import Whisper
+import time
+def get_args():
+    parser = argparse.ArgumentParser(
+        prog="whisper",
+        description="Run Whisper on input audio file"
+    )
+    parser.add_argument("--wav", "-w", type=str, required=True, help="Input audio file")
+    parser.add_argument("--model_type", "-t", type=str, choices=["tiny", "base", "small", "large", "large-v3", "turbo"], required=True, help="model type, only support tiny, base and small currently")
+    parser.add_argument("--model_path", "-p", type=str, required=False, default="../models/models-ax650", help="model path for *.axmodel, tokens.txt, positional_embedding.bin")
+    parser.add_argument("--language", "-l", type=str, required=False, default="zh", help="Target language, support en, zh, ja, and others. See languages.py for more options.")
+    parser.add_argument("--task", type=str, required=False, choices=["translate", "transcribe"], default="transcribe")
+    parser.add_argument("--print_rtf", action="store_true", help="Print Real-Time Factor")
+    return parser.parse_args()
+def print_args(args):
+    print(f"wav: {args.wav}")
+    print(f"model_type: {args.model_type}")
+    print(f"model_path: {args.model_path}")
+    print(f"language: {args.language}")
+    print(f"task: {args.task}")
+def main():
+    args = get_args()
+    print_args(args)
+    # Check wav existence
+    wav_path = args.wav
+    assert os.path.exists(wav_path), f"{wav_path} NOT exist"
+    model = Whisper(args.model_type, args.model_path, args.language, args.task)
+    print("\n预测结果:")
+    start = time.time()
+    print(model.run(wav_path))
+    end = time.time()
+    if args.print_rtf:
+        import librosa
+        samples, sr = librosa.load(wav_path, sr=16000)
+        duration = len(samples) / sr
+        process_time = end - start
+        print(f"RTF: {process_time / duration}")
+if __name__ == "__main__":
+    main()

python/requirements.txt CHANGED Viewed

@@ -1,4 +1,6 @@
 numpy==1.26.4
 soundfile
-librosa
-zhconv

 numpy==1.26.4
 soundfile
+librosa==0.9.1
+zhconv
+jiwer
+tiktoken

python/test_wer.py ADDED Viewed

	@@ -0,0 +1,272 @@

+import argparse
+import os
+import logging
+import re
+from whisper import Whisper
+def setup_logging():
+    """配置日志系统，同时输出到控制台和文件"""
+    # 获取脚本所在目录
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    log_file = os.path.join(script_dir, "test_wer.log")
+    # 配置日志格式
+    log_format = '%(asctime)s - %(levelname)s - %(message)s'
+    date_format = '%Y-%m-%d %H:%M:%S'
+    # 创建logger
+    logger = logging.getLogger()
+    logger.setLevel(logging.INFO)
+    # 清除现有的handler
+    for handler in logger.handlers[:]:
+        logger.removeHandler(handler)
+    # 创建文件handler
+    file_handler = logging.FileHandler(log_file, mode='a', encoding='utf-8')
+    file_handler.setLevel(logging.INFO)
+    file_formatter = logging.Formatter(log_format, date_format)
+    file_handler.setFormatter(file_formatter)
+    # 创建控制台handler
+    console_handler = logging.StreamHandler()
+    console_handler.setLevel(logging.INFO)
+    console_formatter = logging.Formatter(log_format, date_format)
+    console_handler.setFormatter(console_formatter)
+    # 添加handler到logger
+    logger.addHandler(file_handler)
+    logger.addHandler(console_handler)
+    return logger
+class AIShellDataset:
+    def __init__(self, gt_path: str):
+        """
+        初始化数据集
+        Args:
+            json_path: voice.json文件的路径
+        """
+        self.gt_path = gt_path
+        self.dataset_dir = os.path.dirname(gt_path)
+        self.voice_dir = os.path.join(self.dataset_dir, "aishell_S0764")
+        # 检查必要文件和文件夹是否存在
+        assert os.path.exists(gt_path), f"gt文件不存在: {gt_path}"
+        assert os.path.exists(self.voice_dir), f"aishell_S0764文件夹不存在: {self.voice_dir}"
+        # 加载数据
+        self.data = []
+        with open(gt_path, 'r', encoding='utf-8') as f:
+            for line in f:
+                line = line.strip()
+                audio_path, gt = line.split(" ")
+                audio_path = os.path.join(self.voice_dir, audio_path + ".wav")
+                self.data.append({"audio_path": audio_path, "gt": gt})
+        # 使用logging而不是print
+        logger = logging.getLogger()
+        logger.info(f"加载了 {len(self.data)} 条数据")
+    def __iter__(self):
+        """返回迭代器"""
+        self.index = 0
+        return self
+    def __next__(self):
+        """返回下一个数据项"""
+        if self.index >= len(self.data):
+            raise StopIteration
+        item = self.data[self.index]
+        audio_path = item["audio_path"]
+        ground_truth = item["gt"]
+        self.index += 1
+        return audio_path, ground_truth
+    def __len__(self):
+        """返回数据集大小"""
+        return len(self.data)
+class CommonVoiceDataset:
+    """Common Voice数据集解析器"""
+    def __init__(self, tsv_path: str):
+        """
+        初始化数据集
+        Args:
+            json_path: voice.json文件的路径
+        """
+        self.tsv_path = tsv_path
+        self.dataset_dir = os.path.dirname(tsv_path)
+        self.voice_dir = os.path.join(self.dataset_dir, "clips")
+        # 检查必要文件和文件夹是否存在
+        assert os.path.exists(tsv_path), f"{tsv_path}文件不存在: {tsv_path}"
+        assert os.path.exists(self.voice_dir), f"voice文件夹不存在: {self.voice_dir}"
+        # 加载JSON数据
+        self.data = []
+        with open(tsv_path, 'r', encoding='utf-8') as f:
+            f.readline()
+            for line in f:
+                line = line.strip()
+                splits = line.split("\t")
+                audio_path = splits[1]
+                gt = splits[2]
+                audio_path = os.path.join(self.voice_dir, audio_path)
+                self.data.append({"audio_path": audio_path, "gt": gt})
+        # 使用logging而不是print
+        logger = logging.getLogger()
+        logger.info(f"加载了 {len(self.data)} 条数据")
+    def __iter__(self):
+        """返回迭代器"""
+        self.index = 0
+        return self
+    def __next__(self):
+        """返回下一个数据项"""
+        if self.index >= len(self.data):
+            raise StopIteration
+        item = self.data[self.index]
+        audio_path = item["audio_path"]
+        ground_truth = item["gt"]
+        self.index += 1
+        return audio_path, ground_truth
+    def __len__(self):
+        """返回数据集大小"""
+        return len(self.data)
+def get_args():
+    parser = argparse.ArgumentParser(
+        prog="whisper",
+        description="Test WER on dataset"
+    )
+    parser.add_argument("--dataset", "-d", type=str, required=True, choices=["aishell", "common_voice"], help="Test dataset")
+    parser.add_argument("--gt_path", "-g", type=str, required=True, help="Test dataset ground truth file")
+    parser.add_argument("--max_num", type=int, default=-1, required=False, help="Maximum test data num")
+    parser.add_argument("--model_type", "-t", type=str, choices=["tiny", "base", "small", "large", "large-v3", "turbo"], required=True, help="model type, only support tiny, base and small currently")
+    parser.add_argument("--model_path", "-p", type=str, required=False, default="../models/models-ax650", help="model path for *.axmodel, tokens.txt, positional_embedding.bin")
+    parser.add_argument("--language", "-l", type=str, required=False, default="zh", help="Target language, support en, zh, ja, and others. See languages.py for more options.")
+    return parser.parse_args()
+def print_args(args):
+    logger = logging.getLogger()
+    logger.info(f"dataset: {args.dataset}")
+    logger.info(f"gt_path: {args.gt_path}")
+    logger.info(f"max_num: {args.max_num}")
+    logger.info(f"model_type: {args.model_type}")
+    logger.info(f"model_path: {args.model_path}")
+    logger.info(f"language: {args.language}")
+def min_distance(word1: str, word2: str) -> int:
+    row = len(word1) + 1
+    column = len(word2) + 1
+    cache = [ [0]*column for i in range(row) ]
+    for i in range(row):
+        for j in range(column):
+            if i ==0 and j ==0:
+                cache[i][j] = 0
+            elif i == 0 and j!=0:
+                cache[i][j] = j
+            elif j == 0 and i!=0:
+                cache[i][j] = i
+            else:
+                if word1[i-1] == word2[j-1]:
+                    cache[i][j] = cache[i-1][j-1]
+                else:
+                    replace = cache[i-1][j-1] + 1
+                    insert = cache[i][j-1] + 1
+                    remove = cache[i-1][j] + 1
+                    cache[i][j] = min(replace, insert, remove)
+    return cache[row-1][column-1]
+def remove_punctuation(text):
+    # 定义正则表达式模式，匹配所有标点符号
+    # 这个模式包括常见的标点符号和中文标点
+    pattern = r'[^\w\s]|_'
+    # 使用sub方法将所有匹配的标点符号替换为空字符串
+    cleaned_text = re.sub(pattern, '', text)
+    return cleaned_text
+def main():
+    # 设置日志系统
+    logger = setup_logging()
+    args = get_args()
+    print_args(args)
+    dataset_type = args.dataset.lower()
+    if dataset_type == "aishell":
+        dataset = AIShellDataset(args.gt_path)
+    elif dataset_type == "common_voice":
+        dataset = CommonVoiceDataset(args.gt_path)
+    else:
+        raise ValueError(f"Unknown dataset type {dataset_type}")
+    max_num = args.max_num
+    # Load model
+    model = Whisper(args.model_type, args.model_path, args.language, "transcribe")
+    # Iterate over dataset
+    references = []
+    hyp = []
+    all_character_error_num = 0
+    all_character_num = 0
+    wer_file = open("wer.txt", "w")
+    max_data_num = max_num if max_num > 0 else len(dataset)
+    for n, (audio_path, reference) in enumerate(dataset):
+        hypothesis = model.run(audio_path)
+        hypothesis = remove_punctuation(hypothesis)
+        reference = remove_punctuation(reference)
+        character_error_num = min_distance(reference, hypothesis)
+        character_num = len(reference)
+        character_error_rate = character_error_num / character_num * 100
+        all_character_error_num += character_error_num
+        all_character_num += character_num
+        hyp.append(hypothesis)
+        references.append(reference)
+        line_content = f"({n+1}/{max_data_num}) {os.path.basename(audio_path)}  gt: {reference}  predict: {hypothesis}  WER: {character_error_rate}%"
+        wer_file.write(line_content + "\n")
+        logger.info(line_content)
+        if n + 1 >= max_data_num:
+            break
+    total_character_error_rate = all_character_error_num / all_character_num * 100
+    logger.info(f"Total WER: {total_character_error_rate}%")
+    wer_file.write(f"Total WER: {total_character_error_rate}%")
+    wer_file.close()
+if __name__ == "__main__":
+    main()

python/whisper.py CHANGED Viewed

@@ -1,240 +1,224 @@
-import argparse
 import axengine as axe
 import numpy as np
 import librosa
 import os
-from typing import Tuple
-import soundfile as sf
-import base64
 import zhconv
-import time
-from languages import WHISPER_LANGUAGES
-WHISPER_N_MELS      = 80
-WHISPER_SAMPLE_RATE = 16000
-WHISPER_N_FFT       = 480
-WHISPER_HOP_LENGTH  = 160
-WHISPER_SOT           = 50258
-WHISPER_EOT           = 50257
-WHISPER_BLANK         = 220
-WHISPER_NO_TIMESTAMPS = 50363
-WHISPER_NO_SPEECH     = 50362
-WHISPER_TRANSLATE     = 50358
-WHISPER_TRANSCRIBE    = 50359
-WHISPER_VOCAB_SIZE    = 51865
-WHISPER_N_TEXT_CTX    = 448
-NEG_INF = float("-inf")
-SOT_SEQUENCE = np.array([WHISPER_SOT,WHISPER_SOT + 1 + tuple(WHISPER_LANGUAGES).index("zh"),WHISPER_TRANSCRIBE,WHISPER_NO_TIMESTAMPS], dtype=np.int32)
-WHISPER_N_TEXT_STATE_MAP = {
-    "tiny": 384,
-    "base": 512,
-    "small": 768
-}
-def get_args():
-    parser = argparse.ArgumentParser(
-        prog="whisper",
-        description="Run Whisper on input audio file"
-    )
-    parser.add_argument("--wav", "-w", type=str, required=True, help="Input audio file")
-    parser.add_argument("--model_type", "-t", type=str, choices=["tiny", "base", "small"], required=True, help="model type, only support tiny, base and small currently")
-    parser.add_argument("--model_path", "-p", type=str, required=False, default="../models", help="model path for *.axmodel, tokens.txt, positional_embedding.bin")
-    parser.add_argument("--language", "-l", type=str, required=False, default="zh", help="Target language, support en, zh, ja, and others. See languages.py for more options.")
-    return parser.parse_args()
-def print_args(args):
-    print(f"wav: {args.wav}")
-    print(f"model_type: {args.model_type}")
-    print(f"model_path: {args.model_path}")
-    print(f"language: {args.language}")
-def load_audio(filename: str) -> Tuple[np.ndarray, int]:
-    data, sample_rate = sf.read(
-        filename,
-        always_2d=True,
-        dtype="float32",
-    )
-    data = data[:, 0]  # use only the first channel
-    data = librosa.resample(data, orig_sr=sample_rate, target_sr=WHISPER_SAMPLE_RATE)
-    samples = np.ascontiguousarray(data)
-    return samples, sample_rate
-def load_models(model_path, model_type):
-    encoder_path = f"{model_type}-encoder.axmodel"
-    decoder_main_path = f"{model_type}-decoder-main.axmodel"
-    decoder_loop_path = f"{model_type}-decoder-loop.axmodel"
-    pe_path = f"{model_type}-positional_embedding.bin"
-    token_path = f"{model_type}-tokens.txt"
-    required_files = [os.path.join(model_path, i) for i in (encoder_path, decoder_main_path, decoder_loop_path, pe_path, token_path)]
-    # Check file existence
-    for i, file_path in enumerate(required_files):
-        assert os.path.exists(file_path), f"{file_path} NOT exist"
-    # Load encoder
-    encoder = axe.InferenceSession(required_files[0])
-    # Load decoder main
-    decoder_main = axe.InferenceSession(required_files[1])
-    # Load decoder loop
-    decoder_loop = axe.InferenceSession(required_files[2])
-    # Load position embedding
-    pe = np.fromfile(required_files[3], dtype=np.float32)
-    # Load tokens
-    tokens = []
-    with open(required_files[4], "r") as f:
-        for line in f:
-            line = line.strip()
-            tokens.append(line.split(" ")[0])
-    return encoder, decoder_main, decoder_loop, pe, tokens
-def compute_feature(wav_path, n_mels = WHISPER_N_MELS, padding = 480000):
-    audio, sr = load_audio(wav_path)
-    audio = np.concatenate((audio, np.zeros((padding,), dtype=np.float32)), axis=-1)
-    mel = librosa.feature.melspectrogram(y=audio, sr=sr, n_fft=WHISPER_N_FFT, hop_length=WHISPER_HOP_LENGTH, window="hann", center=True, pad_mode="reflect", power=2.0, n_mels=n_mels)
-    log_spec = np.log10(np.maximum(mel, 1e-10))
-    log_spec = np.maximum(log_spec, log_spec.max() - 8.0)
-    mel = (log_spec + 4.0) / 4.0
-    # We pad 1500 frames at the end so that it is able to detect eot
-    # You can use another value instead of 1500.
-    # mel = np.concatenate((mel, np.zeros((n_mels, 1500), dtype=np.float32)), axis=-1)
-    target = 3000
-    if mel.shape[1] > target:
-        # -50 so that there are some zero tail paddings.
-        mel = mel[:, : target]
-        mel[:, -50:] = 0
-    # We don't need to pad it to 30 seconds now!
-    if mel.shape[1] < target:
-        mel = np.concatenate((mel, np.zeros((n_mels, target - mel.shape[1]), dtype=np.float32)), axis=-1)
-    return mel
-def supress_tokens(logits, is_initial):
-    if is_initial:
-        logits[WHISPER_EOT] = NEG_INF
-        logits[WHISPER_BLANK] = NEG_INF
-    logits[WHISPER_NO_TIMESTAMPS] = NEG_INF
-    logits[WHISPER_SOT] = NEG_INF
-    logits[WHISPER_NO_SPEECH] = NEG_INF
-    logits[WHISPER_TRANSLATE] = NEG_INF
-    return logits
-def choose_language(lang):
-    if lang not in WHISPER_LANGUAGES.keys():
-        raise Exception(f"Unknown language: {lang}. Check languages.py for correct options.")
-    SOT_SEQUENCE[1] = WHISPER_SOT + 1 + tuple(WHISPER_LANGUAGES.keys()).index(lang)
-def main():
-    args = get_args()
-    print_args(args)
-    # Check wav existence
-    wav_path = args.wav
-    assert os.path.exists(wav_path), f"{wav_path} NOT exist"
-    # Choose language
-    choose_language(args.language)
-    # Load models and other stuff
-    start = time.time()
-    encoder, decoder_main, decoder_loop, pe, token_table = load_models(args.model_path, args.model_type)
-    print(f"Load models take {(time.time() - start) * 1000}ms")
-    WHISPER_N_TEXT_STATE = WHISPER_N_TEXT_STATE_MAP[args.model_type]
-    # Preprocess
-    start = time.time()
-    mel = compute_feature(wav_path, n_mels=WHISPER_N_MELS)
-    print(f"Preprocess wav take {(time.time() - start) * 1000}ms")
-    # mel.tofile("mel.bin")
-    # Run encoder
-    start = time.time()
-    x = encoder.run(None, input_feed={"mel": mel[None, ...]})
-    n_layer_cross_k, n_layer_cross_v = x
-    print(f"Run encoder take {(time.time() - start) * 1000}ms")
-    # n_layer_cross_k.tofile("n_layer_cross_k.bin")
-    # n_layer_cross_v.tofile("n_layer_cross_v.bin")
-    # Run decoder_main
-    start = time.time()
-    x = decoder_main.run(None, input_feed={
-        "tokens": SOT_SEQUENCE[None, ...],
-        "n_layer_cross_k": n_layer_cross_k,
-        "n_layer_cross_v": n_layer_cross_v
-    })
-    logits, n_layer_self_k_cache, n_layer_self_v_cache = x
-    print(f"Run decoder_main take {(time.time() - start) * 1000}ms")
-    # Decode token
-    logits = logits[0, -1, :]
-    logits = supress_tokens(logits, is_initial=True)
-    # logits.tofile("logits.bin")
-    max_token_id = np.argmax(logits)
-    output_tokens = []
-    print(f"First token: {max_token_id}")
-    # Position embedding offset
-    offset = SOT_SEQUENCE.shape[0]
-    # Autoregressively run decoder until token meets EOT
-    for i in range(WHISPER_N_TEXT_CTX - SOT_SEQUENCE.shape[0]):
-        if max_token_id == WHISPER_EOT:
-            break
-        output_tokens.append(max_token_id)
-        mask = np.zeros((WHISPER_N_TEXT_CTX,), dtype=np.float32)
-        mask[: WHISPER_N_TEXT_CTX - offset - 1] = NEG_INF
-        # Run decoder_loop
-        start = time.time()
-        x = decoder_loop.run(None, input_feed={
-            "tokens": np.array([[output_tokens[-1]]], dtype=np.int32),
-            "in_n_layer_self_k_cache": n_layer_self_k_cache,
-            "in_n_layer_self_v_cache": n_layer_self_v_cache,
             "n_layer_cross_k": n_layer_cross_k,
-            "n_layer_cross_v": n_layer_cross_v,
-            "positional_embedding": pe[offset * WHISPER_N_TEXT_STATE : (offset + 1) * WHISPER_N_TEXT_STATE][None, ...],
-            "mask": mask
         })
         logits, n_layer_self_k_cache, n_layer_self_v_cache = x
-        print(f"Run decoder_loop take {(time.time() - start) * 1000}ms")
         # Decode token
-        offset += 1
-        logits = supress_tokens(logits.flatten(), is_initial=False)
         max_token_id = np.argmax(logits)
-        print(f"Iter {i} \t Token: {max_token_id}")
-    s = b""
-    for i in output_tokens:
-        s += base64.b64decode(token_table[i])
-    # print(s.decode().strip())
-    pd = s.decode().strip()
-    if args.language == "zh":
-        pd = zhconv.convert(pd, 'zh-hans')
-    print(f"Result: {pd}")
-if __name__ == "__main__":
-    main()

 import axengine as axe
 import numpy as np
 import librosa
 import os
+from typing import Union
+from whisper_tokenizer import *
+import json
+from dataclasses import dataclass
 import zhconv
+NEG_INF = float("-inf")
+@dataclass
+class WhisperConfig:
+    n_mels          : int = 0
+    sample_rate     : int = 0
+    n_fft           : int = 0
+    hop_length      : int = 0
+    sot             : int = 0
+    eot             : int = 0
+    blank_id        : int = 0
+    no_timestamps   : int = 0
+    no_speech       : int = 0
+    translate       : int = 0
+    transcribe      : int = 0
+    n_vocab         : int = 0
+    n_text_ctx      : int = 0
+    n_text_state    : int = 0
+    sot_sequence    : np.ndarray = field(default_factory=lambda: np.array([0,0,0,0], dtype=np.int32))
+class Whisper:
+    def __init__(self, model_type: str, model_path: str, language: str, task: str):
+        assert task in ["translate", "transcribe"]
+        self.language = language
+        self.task = task
+        self.encoder, self.decoder_main, self.decoder_loop, self.pe, self.tokenizer, model_config = \
+            self.load_model(model_type, model_path, language, task)
+        self.config = self.load_config(model_config)
+    def load_model(self, model_type, model_path, language, task):
+        encoder_path = f"{model_type}/{model_type}-encoder.axmodel"
+        decoder_main_path = f"{model_type}/{model_type}-decoder-main.axmodel"
+        decoder_loop_path = f"{model_type}/{model_type}-decoder-loop.axmodel"
+        pe_path = f"{model_type}/{model_type}-positional_embedding.bin"
+        model_config_file = f"{model_type}/{model_type}_config.json"
+        required_files = [os.path.join(model_path, i) for i in (encoder_path, decoder_main_path, decoder_loop_path, pe_path, model_config_file)]
+        # Check file existence
+        for i, file_path in enumerate(required_files):
+            assert os.path.exists(file_path), f"{file_path} NOT exist"
+        # Load encoder
+        encoder = axe.InferenceSession(required_files[0], providers=['AxEngineExecutionProvider'])
+        # Load decoder main
+        decoder_main = axe.InferenceSession(required_files[1], providers=['AxEngineExecutionProvider'])
+        # Load decoder loop
+        decoder_loop = axe.InferenceSession(required_files[2], providers=['AxEngineExecutionProvider'])
+        # Load position embedding
+        pe = np.fromfile(required_files[3], dtype=np.float32)
+        # Load tokens
+        model_config = json.load(open(required_files[4], "r"))
+        model_config["all_language_tokens"] = [int(i) for i in model_config["all_language_tokens"].split(",")]
+        model_config["all_language_codes"] = [i for i in model_config["all_language_codes"].split(",")]
+        tokenizer = get_tokenizer(
+            model_config["is_multilingual"],
+            num_languages=len(model_config["all_language_codes"]),
+            language=language,
+            task=task,
+        )
+        return encoder, decoder_main, decoder_loop, pe, tokenizer, model_config
+    def load_config(self, model_config):
+        config = WhisperConfig
+        config.n_mels = model_config["n_mels"]
+        config.sample_rate = 16000
+        config.n_fft = 480
+        config.hop_length = 160
+        config.sot = model_config["sot"]
+        config.eot = model_config["eot"]
+        config.blank_id = model_config["blank_id"]
+        config.no_timestamps = model_config["no_timestamps"]
+        config.no_speech = model_config["no_speech"]
+        config.translate = model_config["translate"]
+        config.transcribe = model_config["transcribe"]
+        config.n_vocab = model_config["n_vocab"]
+        config.n_text_ctx = model_config["n_text_ctx"]
+        config.n_text_state = model_config["n_text_state"]
+        lang_token = model_config["all_language_tokens"][model_config["all_language_codes"].index(self.language)]
+        task_token = config.transcribe if self.task == "transcribe" else config.translate
+        config.sot_sequence = np.array([config.sot, lang_token, task_token, config.no_timestamps], dtype=np.int32)
+        return config
+    def load_audio(self, audio: str):
+        data, sample_rate = librosa.load(audio, sr=self.config.sample_rate)
+        samples = np.ascontiguousarray(data)
+        return samples, sample_rate
+    def compute_feature(self, audio: np.ndarray, padding = 480000):
+        if padding > 0:
+            audio = np.concatenate((audio, np.zeros((padding,), dtype=np.float32)), axis=-1)
+        mel = librosa.feature.melspectrogram(y=audio,
+                                             sr=self.config.sample_rate,
+                                             n_fft=self.config.n_fft,
+                                             hop_length=self.config.hop_length,
+                                             window="hann",
+                                             center=True,
+                                             pad_mode="reflect",
+                                             power=2.0,
+                                             n_mels=self.config.n_mels)
+        log_spec = np.log10(np.maximum(mel, 1e-10))
+        log_spec = np.maximum(log_spec, log_spec.max() - 8.0)
+        mel = (log_spec + 4.0) / 4.0
+        target = 3000
+        if mel.shape[1] > target:
+            # -50 so that there are some zero tail paddings.
+            mel = mel[:, : target]
+            mel[:, -50:] = 0
+        # We don't need to pad it to 30 seconds now!
+        if mel.shape[1] < target:
+            mel = np.concatenate((mel, np.zeros((self.config.n_mels, target - mel.shape[1]), dtype=np.float32)), axis=-1)
+        return mel
+    def supress_tokens(self, logits, is_initial):
+        if is_initial:
+            logits[self.config.eot] = NEG_INF
+            logits[self.config.blank_id] = NEG_INF
+        logits[self.config.no_timestamps] = NEG_INF
+        logits[self.config.sot] = NEG_INF
+        logits[self.config.no_speech] = NEG_INF
+        if self.task == "transcribe":
+            logits[self.config.translate] = NEG_INF
+        else:
+            logits[self.config.transcribe] = NEG_INF
+        return logits
+    def run(self, audio: Union[str, np.ndarray]) -> str:
+        if isinstance(audio, str):
+            audio, sample_rate = self.load_audio(audio)
+        mel = self.compute_feature(audio)
+        # Run encoder
+        x = self.encoder.run(None, input_feed={"mel": mel[None, ...]})
+        n_layer_cross_k, n_layer_cross_v = x
+        # Run decoder_main
+        x = self.decoder_main.run(None, input_feed={
+            "tokens": self.config.sot_sequence[None, ...],
             "n_layer_cross_k": n_layer_cross_k,
+            "n_layer_cross_v": n_layer_cross_v
         })
         logits, n_layer_self_k_cache, n_layer_self_v_cache = x
         # Decode token
+        logits = logits[0, -1, :]
+        logits = self.supress_tokens(logits, is_initial=True)
+        # logits.tofile("logits.bin")
         max_token_id = np.argmax(logits)
+        output_tokens = []
+        # Position embedding offset
+        offset = self.config.sot_sequence.shape[0]
+        # Autoregressively run decoder until token meets EOT
+        for i in range(self.config.n_text_ctx - self.config.sot_sequence.shape[0]):
+            if max_token_id >= self.config.eot:
+                break
+            output_tokens.append(max_token_id)
+            mask = np.zeros((self.config.n_text_ctx,), dtype=np.float32)
+            mask[: self.config.n_text_ctx - offset - 1] = NEG_INF
+            # Run decoder_loop
+            x = self.decoder_loop.run(None, input_feed={
+                "tokens": np.array([[output_tokens[-1]]], dtype=np.int32),
+                "in_n_layer_self_k_cache": n_layer_self_k_cache,
+                "in_n_layer_self_v_cache": n_layer_self_v_cache,
+                "n_layer_cross_k": n_layer_cross_k,
+                "n_layer_cross_v": n_layer_cross_v,
+                "positional_embedding": self.pe[offset * self.config.n_text_state : (offset + 1) * self.config.n_text_state][None, ...],
+                "mask": mask
+            })
+            logits, n_layer_self_k_cache, n_layer_self_v_cache = x
+            # Decode token
+            offset += 1
+            logits = self.supress_tokens(logits.flatten(), is_initial=False)
+            max_token_id = np.argmax(logits)
+        text = self.tokenizer.decode(output_tokens)
+        if self.language == "zh":
+            try:
+                sim_zh = zhconv.convert(text, 'zh-hans')
+                return sim_zh
+            except:
+                return text
+        return text

python/whisper_cli.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import requests
+def transcribe_audio(
+    server_url: str,
+    wav_path: str,
+    model_type: str = "tiny",
+    model_path: str = "../models/models-ax650",
+    language: str = "zh",
+    task: str = "transcribe"
+):
+    url = f"{server_url.rstrip('/')}/asr"
+    files = {
+        "wav": open(wav_path, "rb"),
+    }
+    data = {
+        "model_type": model_type,
+        "model_path": model_path,
+        "language": language,
+        "task": task,
+    }
+    print(f"Sending request to: {url}")
+    response = requests.post(url, files=files, data=data)
+    if response.status_code != 200:
+        print("❌ Error:", response.text)
+        return None
+    result = response.json()
+    print("服务器返回结果：")
+    print(result)
+    return result
+if __name__ == "__main__":
+    # 你的服务器地址
+    SERVER = "http://127.0.0.1:8000"
+    # 本地 wav 文件路径
+    WAV = "../demo.wav"
+    transcribe_audio(SERVER, WAV)

python/whisper_onnx.py DELETED Viewed

@@ -1,239 +0,0 @@
-import argparse
-import onnxruntime as ort
-import numpy as np
-import librosa
-import os
-from typing import Tuple
-import soundfile as sf
-import base64
-import zhconv
-import time
-import torch
-from torch.nn import functional as F
-from languages import WHISPER_LANGUAGES
-WHISPER_N_MELS      = 80
-WHISPER_SAMPLE_RATE = 16000
-WHISPER_N_FFT       = 480
-WHISPER_HOP_LENGTH  = 160
-WHISPER_SOT           = 50258
-WHISPER_EOT           = 50257
-WHISPER_BLANK         = 220
-WHISPER_NO_TIMESTAMPS = 50363
-WHISPER_NO_SPEECH     = 50362
-WHISPER_TRANSLATE     = 50358
-WHISPER_TRANSCRIBE    = 50359
-WHISPER_VOCAB_SIZE    = 51865
-WHISPER_N_TEXT_CTX    = 448
-NEG_INF = float("-inf")
-SOT_SEQUENCE = np.array([WHISPER_SOT,WHISPER_SOT + 1 + tuple(WHISPER_LANGUAGES).index("zh"),WHISPER_TRANSCRIBE,WHISPER_NO_TIMESTAMPS], dtype=np.int64)
-WHISPER_N_TEXT_STATE_MAP = {
-    "tiny": 384,
-    "base": 512,
-    "small": 768
-}
-def get_args():
-    parser = argparse.ArgumentParser(
-        prog="whisper",
-        description="Run Whisper on input audio file"
-    )
-    parser.add_argument("--wav", "-w", type=str, required=True, help="Input audio file")
-    parser.add_argument("--model_type", "-t", type=str, choices=["tiny", "base", "small"], required=True, help="model type, only support tiny/base/small currently")
-    parser.add_argument("--model_path", "-p", type=str, required=False, default="../models", help="model path for *.axmodel, tokens.txt, positional_embedding.bin")
-    parser.add_argument("--language", "-l", type=str, required=False, default="zh", help="Target language, support en, zh, ja, and others. See languages.py for more options.")
-    return parser.parse_args()
-def print_args(args):
-    print(f"wav: {args.wav}")
-    print(f"model_type: {args.model_type}")
-    print(f"model_path: {args.model_path}")
-    print(f"language: {args.language}")
-def load_audio(filename: str) -> Tuple[np.ndarray, int]:
-    data, sample_rate = sf.read(
-        filename,
-        always_2d=True,
-        dtype="float32",
-    )
-    data = data[:, 0]  # use only the first channel
-    data = librosa.resample(data, orig_sr=sample_rate, target_sr=WHISPER_SAMPLE_RATE)
-    samples = np.ascontiguousarray(data)
-    return samples, sample_rate
-def load_models(model_path, model_type):
-    encoder_path = f"{model_type}-encoder.onnx"
-    decoder_main_path = f"{model_type}-decoder-main.onnx"
-    decoder_loop_path = f"{model_type}-decoder-loop.onnx"
-    pe_path = f"{model_type}-positional_embedding.bin"
-    token_path = f"{model_type}-tokens.txt"
-    required_files = [os.path.join(model_path, i) for i in (encoder_path, decoder_main_path, decoder_loop_path, pe_path, token_path)]
-    # Check file existence
-    for i, file_path in enumerate(required_files):
-        assert os.path.exists(file_path), f"{file_path} NOT exist"
-    # Load encoder
-    encoder = ort.InferenceSession(required_files[0], providers=['CPUExecutionProvider'])
-    # Load decoder main
-    decoder_main = ort.InferenceSession(required_files[1], providers=['CPUExecutionProvider'])
-    # Load decoder loop
-    decoder_loop = ort.InferenceSession(required_files[2], providers=['CPUExecutionProvider'])
-    # Load position embedding
-    pe = np.fromfile(required_files[3], dtype=np.float32)
-    # Load tokens
-    tokens = []
-    with open(required_files[4], "r") as f:
-        for line in f:
-            line = line.strip()
-            tokens.append(line.split(" ")[0])
-    return encoder, decoder_main, decoder_loop, pe, tokens
-def compute_feature(wav_path, n_mels = WHISPER_N_MELS, padding = 480000):
-    audio, sr = load_audio(wav_path)
-    audio = np.concatenate((audio, np.zeros((padding,), dtype=np.float32)), axis=-1)
-    mel = librosa.feature.melspectrogram(y=audio, sr=sr, n_fft=WHISPER_N_FFT, hop_length=WHISPER_HOP_LENGTH, window="hann", center=True, pad_mode="reflect", power=2.0, n_mels=n_mels)
-    log_spec = np.log10(np.maximum(mel, 1e-10))
-    log_spec = np.maximum(log_spec, log_spec.max() - 8.0)
-    mel = (log_spec + 4.0) / 4.0
-    # We pad 1500 frames at the end so that it is able to detect eot
-    # You can use another value instead of 1500.
-    # mel = np.concatenate((mel, np.zeros((n_mels, 1500), dtype=np.float32)), axis=-1)
-    target = 3000
-    if mel.shape[1] > target:
-        # -50 so that there are some zero tail paddings.
-        mel = mel[:, : target]
-        mel[:, -50:] = 0
-    # We don't need to pad it to 30 seconds now!
-    if mel.shape[1] < target:
-        mel = np.concatenate((mel, np.zeros((n_mels, target - mel.shape[1]), dtype=np.float32)), axis=-1)
-    return mel
-def supress_tokens(logits, is_initial):
-    if is_initial:
-        logits[WHISPER_EOT] = NEG_INF
-        logits[WHISPER_BLANK] = NEG_INF
-    logits[WHISPER_NO_TIMESTAMPS] = NEG_INF
-    logits[WHISPER_SOT] = NEG_INF
-    logits[WHISPER_NO_SPEECH] = NEG_INF
-    logits[WHISPER_TRANSLATE] = NEG_INF
-    return logits
-def choose_language(lang):
-    if lang not in WHISPER_LANGUAGES.keys():
-        raise Exception(f"Unknown language: {lang}. Check languages.py for correct options.")
-    SOT_SEQUENCE[1] = WHISPER_SOT + 1 + tuple(WHISPER_LANGUAGES.keys()).index(lang)
-def main():
-    args = get_args()
-    print_args(args)
-    # Check wav existence
-    wav_path = args.wav
-    assert os.path.exists(wav_path), f"{wav_path} NOT exist"
-    # Choose language
-    choose_language(args.language)
-    # Load models and other stuff
-    encoder, decoder_main, decoder_loop, pe, token_table = load_models(args.model_path, args.model_type)
-    WHISPER_N_TEXT_STATE = WHISPER_N_TEXT_STATE_MAP[args.model_type]
-    # Preprocess
-    mel = compute_feature(wav_path, n_mels=WHISPER_N_MELS)
-    # mel.tofile("mel.bin")
-    # mel = np.load("../mel.npy")[..., :3000]
-    # Run encoder
-    start = time.time()
-    x = encoder.run(None, input_feed={"mel": mel[None, ...]})
-    n_layer_cross_k, n_layer_cross_v = x
-    print(f"Run encoder take {(time.time() - start) * 1000}ms")
-    # n_layer_cross_k.tofile("n_layer_cross_k.bin")
-    # n_layer_cross_v.tofile("n_layer_cross_v.bin")
-    # Run decoder_main
-    start = time.time()
-    x = decoder_main.run(None, input_feed={
-        "tokens": SOT_SEQUENCE[None, ...],
-        "n_layer_cross_k": n_layer_cross_k,
-        "n_layer_cross_v": n_layer_cross_v
-    })
-    logits, n_layer_self_k_cache, n_layer_self_v_cache = x
-    print(f"Run decoder_main take {(time.time() - start) * 1000}ms")
-    # Decode token
-    logits = logits[0, -1, :]
-    logits = supress_tokens(logits, is_initial=True)
-    # logits.tofile("logits.bin")
-    max_token_id = np.argmax(logits)
-    output_tokens = []
-    print(f"First token: {max_token_id}")
-    # Position embedding offset
-    offset = SOT_SEQUENCE.shape[0]
-    # Autoregressively run decoder until token meets EOT
-    for i in range(WHISPER_N_TEXT_CTX - SOT_SEQUENCE.shape[0]):
-        if max_token_id == WHISPER_EOT:
-            break
-        output_tokens.append(max_token_id)
-        mask = np.zeros((WHISPER_N_TEXT_CTX,), dtype=np.float32)
-        mask[: WHISPER_N_TEXT_CTX - offset - 1] = NEG_INF
-        # Run decoder_loop
-        start = time.time()
-        x = decoder_loop.run(None, input_feed={
-            "tokens": np.array([[output_tokens[-1]]], dtype=np.int64),
-            "in_n_layer_self_k_cache": n_layer_self_k_cache,
-            "in_n_layer_self_v_cache": n_layer_self_v_cache,
-            "n_layer_cross_k": n_layer_cross_k,
-            "n_layer_cross_v": n_layer_cross_v,
-            "positional_embedding": pe[offset * WHISPER_N_TEXT_STATE : (offset + 1) * WHISPER_N_TEXT_STATE][None, ...],
-            "mask": mask
-        })
-        logits, n_layer_self_k_cache, n_layer_self_v_cache = x
-        print(f"Run decoder_loop take {(time.time() - start) * 1000}ms")
-        # Decode token
-        offset += 1
-        logits = supress_tokens(logits.flatten(), is_initial=False)
-        max_token_id = np.argmax(logits)
-        print(f"Iter {i} \t Token: {max_token_id}")
-    s = b""
-    for i in output_tokens:
-        s += base64.b64decode(token_table[i])
-    # print(s.decode().strip())
-    pd = s.decode().strip()
-    if args.language == "zh":
-        pd = zhconv.convert(pd, 'zh-hans')
-    print(f"Result: {pd}")
-if __name__ == "__main__":
-    main()

python/whisper_svr.py ADDED Viewed

	@@ -0,0 +1,104 @@

+import argparse
+import json
+import os
+import tempfile
+from http.server import BaseHTTPRequestHandler, HTTPServer
+from urllib.parse import parse_qs
+from whisper import Whisper
+import cgi
+# 模型缓存：避免每次请求都重新加载
+_model_cache = {}
+def get_model(model_type, model_path, language, task):
+    key = (model_type, model_path, language, task)
+    if key not in _model_cache:
+        print(f"Loading model: type={model_type}, path={model_path}, lang={language}, task={task}")
+        _model_cache[key] = Whisper(model_type, model_path, language, task)
+    return _model_cache[key]
+class WhisperHandler(BaseHTTPRequestHandler):
+    def _send_json(self, obj, status=200):
+        data = json.dumps(obj, ensure_ascii=False).encode("utf-8")
+        self.send_response(status)
+        self.send_header("Content-Type", "application/json; charset=utf-8")
+        self.send_header("Content-Length", str(len(data)))
+        self.end_headers()
+        self.wfile.write(data)
+    def do_GET(self):
+        if self.path == "/health":
+            self._send_json({"status": "ok"})
+        else:
+            self._send_json({"error": "not found"}, 404)
+    def do_POST(self):
+        if self.path != "/asr":
+            self._send_json({"error": "not found"}, 404)
+            return
+        # 解析 multipart/form-data
+        content_type = self.headers.get('Content-Type')
+        if not content_type:
+            self._send_json({"error": "Missing Content-Type"}, 400)
+            return
+        ctype, pdict = cgi.parse_header(content_type)
+        if ctype != 'multipart/form-data':
+            self._send_json({"error": "Only multipart/form-data is supported"}, 400)
+            return
+        pdict['boundary'] = bytes(pdict['boundary'], "utf-8")
+        pdict['CONTENT-LENGTH'] = int(self.headers['Content-Length'])
+        form = cgi.parse_multipart(self.rfile, pdict)
+        # 必须包含 wav 文件
+        if "wav" not in form:
+            self._send_json({"error": "Field 'wav' is required"}, 400)
+            return
+        # 获取参数（如果缺省则使用默认值）
+        model_type = form.get("model_type", ["tiny"])[0]
+        model_path = form.get("model_path", ["../models/models-ax650"])[0]
+        language = form.get("language", ["zh"])[0]
+        task = form.get("task", ["transcribe"])[0]
+        if task not in ("transcribe", "translate"):
+            self._send_json({"error": "task must be 'transcribe' or 'translate'"}, 400)
+            return
+        wav_bytes = form["wav"][0]
+        # 写入临时文件
+        with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp:
+            tmp.write(wav_bytes)
+            wav_path = tmp.name
+        # 加载模型并运行
+        try:
+            model = get_model(model_type, model_path, language, task)
+            result_text = model.run(wav_path)
+        except Exception as e:
+            self._send_json({"error": str(e)}, 500)
+            return
+        finally:
+            if os.path.exists(wav_path):
+                os.remove(wav_path)
+        self._send_json({"text": result_text})
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Whisper Server")
+    parser.add_argument("--port", type=int, default=8000, help="Port to run the server on")
+    args = parser.parse_args()
+    port = args.port
+    server = HTTPServer(("0.0.0.0", port), WhisperHandler)
+    print(f"Server started at http://0.0.0.0:{port}")
+    server.serve_forever()

python/whisper_tokenizer.py ADDED Viewed

	@@ -0,0 +1,395 @@

+import base64
+import os
+import string
+from dataclasses import dataclass, field
+from functools import cached_property, lru_cache
+from typing import Dict, List, Optional, Tuple
+import tiktoken
+LANGUAGES = {
+    "en": "english",
+    "zh": "chinese",
+    "de": "german",
+    "es": "spanish",
+    "ru": "russian",
+    "ko": "korean",
+    "fr": "french",
+    "ja": "japanese",
+    "pt": "portuguese",
+    "tr": "turkish",
+    "pl": "polish",
+    "ca": "catalan",
+    "nl": "dutch",
+    "ar": "arabic",
+    "sv": "swedish",
+    "it": "italian",
+    "id": "indonesian",
+    "hi": "hindi",
+    "fi": "finnish",
+    "vi": "vietnamese",
+    "he": "hebrew",
+    "uk": "ukrainian",
+    "el": "greek",
+    "ms": "malay",
+    "cs": "czech",
+    "ro": "romanian",
+    "da": "danish",
+    "hu": "hungarian",
+    "ta": "tamil",
+    "no": "norwegian",
+    "th": "thai",
+    "ur": "urdu",
+    "hr": "croatian",
+    "bg": "bulgarian",
+    "lt": "lithuanian",
+    "la": "latin",
+    "mi": "maori",
+    "ml": "malayalam",
+    "cy": "welsh",
+    "sk": "slovak",
+    "te": "telugu",
+    "fa": "persian",
+    "lv": "latvian",
+    "bn": "bengali",
+    "sr": "serbian",
+    "az": "azerbaijani",
+    "sl": "slovenian",
+    "kn": "kannada",
+    "et": "estonian",
+    "mk": "macedonian",
+    "br": "breton",
+    "eu": "basque",
+    "is": "icelandic",
+    "hy": "armenian",
+    "ne": "nepali",
+    "mn": "mongolian",
+    "bs": "bosnian",
+    "kk": "kazakh",
+    "sq": "albanian",
+    "sw": "swahili",
+    "gl": "galician",
+    "mr": "marathi",
+    "pa": "punjabi",
+    "si": "sinhala",
+    "km": "khmer",
+    "sn": "shona",
+    "yo": "yoruba",
+    "so": "somali",
+    "af": "afrikaans",
+    "oc": "occitan",
+    "ka": "georgian",
+    "be": "belarusian",
+    "tg": "tajik",
+    "sd": "sindhi",
+    "gu": "gujarati",
+    "am": "amharic",
+    "yi": "yiddish",
+    "lo": "lao",
+    "uz": "uzbek",
+    "fo": "faroese",
+    "ht": "haitian creole",
+    "ps": "pashto",
+    "tk": "turkmen",
+    "nn": "nynorsk",
+    "mt": "maltese",
+    "sa": "sanskrit",
+    "lb": "luxembourgish",
+    "my": "myanmar",
+    "bo": "tibetan",
+    "tl": "tagalog",
+    "mg": "malagasy",
+    "as": "assamese",
+    "tt": "tatar",
+    "haw": "hawaiian",
+    "ln": "lingala",
+    "ha": "hausa",
+    "ba": "bashkir",
+    "jw": "javanese",
+    "su": "sundanese",
+    "yue": "cantonese",
+}
+# language code lookup by name, with a few language aliases
+TO_LANGUAGE_CODE = {
+    **{language: code for code, language in LANGUAGES.items()},
+    "burmese": "my",
+    "valencian": "ca",
+    "flemish": "nl",
+    "haitian": "ht",
+    "letzeburgesch": "lb",
+    "pushto": "ps",
+    "panjabi": "pa",
+    "moldavian": "ro",
+    "moldovan": "ro",
+    "sinhalese": "si",
+    "castilian": "es",
+    "mandarin": "zh",
+}
+@dataclass
+class Tokenizer:
+    """A thin wrapper around `tiktoken` providing quick access to special tokens"""
+    encoding: tiktoken.Encoding
+    num_languages: int
+    language: Optional[str] = None
+    task: Optional[str] = None
+    sot_sequence: Tuple[int] = ()
+    special_tokens: Dict[str, int] = field(default_factory=dict)
+    def __post_init__(self):
+        for special in self.encoding.special_tokens_set:
+            special_token = self.encoding.encode_single_token(special)
+            self.special_tokens[special] = special_token
+        sot: int = self.special_tokens["<|startoftranscript|>"]
+        translate: int = self.special_tokens["<|translate|>"]
+        transcribe: int = self.special_tokens["<|transcribe|>"]
+        langs = tuple(LANGUAGES.keys())[: self.num_languages]
+        sot_sequence = [sot]
+        if self.language is not None:
+            sot_sequence.append(sot + 1 + langs.index(self.language))
+        if self.task is not None:
+            task_token: int = transcribe if self.task == "transcribe" else translate
+            sot_sequence.append(task_token)
+        self.sot_sequence = tuple(sot_sequence)
+    def encode(self, text, **kwargs):
+        return self.encoding.encode(text, **kwargs)
+    def decode(self, token_ids: List[int], **kwargs) -> str:
+        token_ids = [t for t in token_ids if t < self.timestamp_begin]
+        return self.encoding.decode(token_ids, **kwargs)
+    def decode_with_timestamps(self, token_ids: List[int], **kwargs) -> str:
+        """
+        Timestamp tokens are above other special tokens' id range and are ignored by `decode()`.
+        This method decodes given tokens with timestamps tokens annotated, e.g. "<|1.08|>".
+        """
+        return self.encoding.decode(token_ids, **kwargs)
+    @cached_property
+    def eot(self) -> int:
+        return self.encoding.eot_token
+    @cached_property
+    def transcribe(self) -> int:
+        return self.special_tokens["<|transcribe|>"]
+    @cached_property
+    def translate(self) -> int:
+        return self.special_tokens["<|translate|>"]
+    @cached_property
+    def sot(self) -> int:
+        return self.special_tokens["<|startoftranscript|>"]
+    @cached_property
+    def sot_lm(self) -> int:
+        return self.special_tokens["<|startoflm|>"]
+    @cached_property
+    def sot_prev(self) -> int:
+        return self.special_tokens["<|startofprev|>"]
+    @cached_property
+    def no_speech(self) -> int:
+        return self.special_tokens["<|nospeech|>"]
+    @cached_property
+    def no_timestamps(self) -> int:
+        return self.special_tokens["<|notimestamps|>"]
+    @cached_property
+    def timestamp_begin(self) -> int:
+        return self.special_tokens["<|0.00|>"]
+    @cached_property
+    def language_token(self) -> int:
+        """Returns the token id corresponding to the value of the `language` field"""
+        if self.language is None:
+            raise ValueError("This tokenizer does not have language token configured")
+        return self.to_language_token(self.language)
+    def to_language_token(self, language):
+        if token := self.special_tokens.get(f"<|{language}|>", None):
+            return token
+        raise KeyError(f"Language {language} not found in tokenizer.")
+    @cached_property
+    def all_language_tokens(self) -> Tuple[int]:
+        result = []
+        for token, token_id in self.special_tokens.items():
+            if token.strip("<|>") in LANGUAGES:
+                result.append(token_id)
+        return tuple(result)[: self.num_languages]
+    @cached_property
+    def all_language_codes(self) -> Tuple[str]:
+        return tuple(self.decode([_l]).strip("<|>") for _l in self.all_language_tokens)
+    @cached_property
+    def sot_sequence_including_notimestamps(self) -> Tuple[int]:
+        return tuple(list(self.sot_sequence) + [self.no_timestamps])
+    @cached_property
+    def non_speech_tokens(self) -> Tuple[int]:
+        """
+        Returns the list of tokens to suppress in order to avoid any speaker tags or non-speech
+        annotations, to prevent sampling texts that are not actually spoken in the audio, e.g.
+        - ♪♪♪
+        - ( SPEAKING FOREIGN LANGUAGE )
+        - [DAVID] Hey there,
+        keeping basic punctuations like commas, periods, question marks, exclamation points, etc.
+        """
+        symbols = list('"#()*+/:;<=>@[\\]^_`{|}~「」『』')
+        symbols += (
+            "<< >> <<< >>> -- --- -( -[ (' (\" (( )) ((( ))) [[ ]] {{ }} ♪♪ ♪♪♪".split()
+        )
+        # symbols that may be a single token or multiple tokens depending on the tokenizer.
+        # In case they're multiple tokens, suppress the first token, which is safe because:
+        # These are between U+2640 and U+267F miscellaneous symbols that are okay to suppress
+        # in generations, and in the 3-byte UTF-8 representation they share the first two bytes.
+        miscellaneous = set("♩♪♫♬♭♮♯")
+        assert all(0x2640 <= ord(c) <= 0x267F for c in miscellaneous)
+        # allow hyphens "-" and single quotes "'" between words, but not at the beginning of a word
+        result = {self.encoding.encode(" -")[0], self.encoding.encode(" '")[0]}
+        for symbol in symbols + list(miscellaneous):
+            for tokens in [
+                self.encoding.encode(symbol),
+                self.encoding.encode(" " + symbol),
+            ]:
+                if len(tokens) == 1 or symbol in miscellaneous:
+                    result.add(tokens[0])
+        return tuple(sorted(result))
+    def split_to_word_tokens(self, tokens: List[int]):
+        if self.language in {"zh", "ja", "th", "lo", "my", "yue"}:
+            # These languages don't typically use spaces, so it is difficult to split words
+            # without morpheme analysis. Here, we instead split words at any
+            # position where the tokens are decoded as valid unicode points
+            return self.split_tokens_on_unicode(tokens)
+        return self.split_tokens_on_spaces(tokens)
+    def split_tokens_on_unicode(self, tokens: List[int]):
+        decoded_full = self.decode_with_timestamps(tokens)
+        replacement_char = "\ufffd"
+        words = []
+        word_tokens = []
+        current_tokens = []
+        unicode_offset = 0
+        for token in tokens:
+            current_tokens.append(token)
+            decoded = self.decode_with_timestamps(current_tokens)
+            if (
+                replacement_char not in decoded
+                or decoded_full[unicode_offset + decoded.index(replacement_char)]
+                == replacement_char
+            ):
+                words.append(decoded)
+                word_tokens.append(current_tokens)
+                current_tokens = []
+                unicode_offset += len(decoded)
+        return words, word_tokens
+    def split_tokens_on_spaces(self, tokens: List[int]):
+        subwords, subword_tokens_list = self.split_tokens_on_unicode(tokens)
+        words = []
+        word_tokens = []
+        for subword, subword_tokens in zip(subwords, subword_tokens_list):
+            special = subword_tokens[0] >= self.eot
+            with_space = subword.startswith(" ")
+            punctuation = subword.strip() in string.punctuation
+            if special or with_space or punctuation or len(words) == 0:
+                words.append(subword)
+                word_tokens.append(subword_tokens)
+            else:
+                words[-1] = words[-1] + subword
+                word_tokens[-1].extend(subword_tokens)
+        return words, word_tokens
+@lru_cache(maxsize=None)
+def get_encoding(name: str = "gpt2", num_languages: int = 99):
+    vocab_path = os.path.join(os.path.dirname(__file__), "assets", f"{name}.tiktoken")
+    ranks = {
+        base64.b64decode(token): int(rank)
+        for token, rank in (line.split() for line in open(vocab_path) if line)
+    }
+    n_vocab = len(ranks)
+    special_tokens = {}
+    specials = [
+        "<|endoftext|>",
+        "<|startoftranscript|>",
+        *[f"<|{lang}|>" for lang in list(LANGUAGES.keys())[:num_languages]],
+        "<|translate|>",
+        "<|transcribe|>",
+        "<|startoflm|>",
+        "<|startofprev|>",
+        "<|nospeech|>",
+        "<|notimestamps|>",
+        *[f"<|{i * 0.02:.2f}|>" for i in range(1501)],
+    ]
+    for token in specials:
+        special_tokens[token] = n_vocab
+        n_vocab += 1
+    return tiktoken.Encoding(
+        name=os.path.basename(vocab_path),
+        explicit_n_vocab=n_vocab,
+        pat_str=r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
+        mergeable_ranks=ranks,
+        special_tokens=special_tokens,
+    )
+@lru_cache(maxsize=None)
+def get_tokenizer(
+    multilingual: bool,
+    *,
+    num_languages: int = 99,
+    language: Optional[str] = None,
+    task: Optional[str] = None,  # Literal["transcribe", "translate", None]
+) -> Tokenizer:
+    if language is not None:
+        language = language.lower()
+        if language not in LANGUAGES:
+            if language in TO_LANGUAGE_CODE:
+                language = TO_LANGUAGE_CODE[language]
+            else:
+                raise ValueError(f"Unsupported language: {language}")
+    if multilingual:
+        encoding_name = "multilingual"
+        language = language or "en"
+        task = task or "transcribe"
+    else:
+        encoding_name = "gpt2"
+        language = None
+        task = None
+    encoding = get_encoding(name=encoding_name, num_languages=num_languages)
+    return Tokenizer(
+        encoding=encoding, num_languages=num_languages, language=language, task=task
+    )