|
|
--- |
|
|
license: mit |
|
|
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
|
|
pipeline_tag: text-generation |
|
|
library_name: litert-lm |
|
|
tags: |
|
|
- chat |
|
|
--- |
|
|
|
|
|
# litert-community/DeepSeek-R1-Distill-Qwen-1.5B |
|
|
|
|
|
This model provides a few variants of |
|
|
[deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) that are ready for |
|
|
deployment on Android using the |
|
|
[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert), |
|
|
[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference) and |
|
|
[LiteRt-LM](https://github.com/google-ai-edge/LiteRT-LM). |
|
|
|
|
|
## Use the models |
|
|
|
|
|
### Colab |
|
|
|
|
|
*Disclaimer: The target deployment surface for the LiteRT models is |
|
|
Android/iOS/Web and the stack has been optimized for performance on these |
|
|
targets. Trying out the system in Colab is an easier way to familiarize yourself |
|
|
with the LiteRT stack, with the caveat that the performance (memory and latency) |
|
|
on Colab could be much worse than on a local device.* |
|
|
|
|
|
[](https://colab.research.google.com/#fileId=https://huggingface.co/litert-community/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/notebook.ipynb) |
|
|
|
|
|
### Android |
|
|
|
|
|
#### Edge Gallery App |
|
|
|
|
|
* Download or build the [app](https://github.com/google-ai-edge/gallery?tab=readme-ov-file#-get-started-in-minutes) from GitHub. |
|
|
|
|
|
* Install the [app](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) from Google Play |
|
|
|
|
|
* Follow the instructions in the app. |
|
|
|
|
|
#### LLM Inference API |
|
|
|
|
|
* Download and install |
|
|
[the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk). |
|
|
* Follow the instructions in the app. |
|
|
|
|
|
To build the demo app from source, please follow the |
|
|
[instructions](https://github.com/google-ai-edge/mediapipe-samples/blob/main/examples/llm_inference/android/README.md) |
|
|
from the GitHub repository. |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Android |
|
|
|
|
|
Note that all benchmark stats are from a Samsung S24 Ultra with |
|
|
1280 KV cache size with multiple prefill signatures enabled. |
|
|
|
|
|
<table border="1"> |
|
|
<tr> |
|
|
<th>Backend</th> |
|
|
<th>Quantization</th> |
|
|
<th>Context Length</th> |
|
|
<th>Prefill (tokens/sec)</th> |
|
|
<th>Decode (tokens/sec)</th> |
|
|
<th>Time-to-first-token (sec)</th> |
|
|
<th>Model size (MB)</th> |
|
|
<th>Peak RSS Memory (MB)</th> |
|
|
<th>GPU Memory (MB)</th> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><p style="text-align: right">CPU</p></td> |
|
|
<td><p style="text-align: right">dynamic_int8</p></td> |
|
|
<td><p style="text-align: right">4096</p></td> |
|
|
<td><p style="text-align: right">166.50 tk/s</p></td> |
|
|
<td><p style="text-align: right">26.35 tk/s</p></td> |
|
|
<td><p style="text-align: right">6.41 s</p></td> |
|
|
<td><p style="text-align: right">1831.43 MB</p></td> |
|
|
<td><p style="text-align: right">2221 MB</p></td> |
|
|
<td><p style="text-align: right">N/A</p></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><p style="text-align: right">GPU</p></td> |
|
|
<td><p style="text-align: right">dynamic_int8</p></td> |
|
|
<td><p style="text-align: right">4096</p></td> |
|
|
<td><p style="text-align: right">927.54 tk/s</p></td> |
|
|
<td><p style="text-align: right">26.98 tk/s</p></td> |
|
|
<td><p style="text-align: right">5.46 s</p></td> |
|
|
<td><p style="text-align: right">1831.43 MB</p></td> |
|
|
<td><p style="text-align: right">2096 MB</p></td> |
|
|
<td><p style="text-align: right">1659 MB</p></td> |
|
|
</tr> |
|
|
|
|
|
</table> |
|
|
|
|
|
* Model Size: measured by the size of the .tflite flatbuffer (serialization |
|
|
format for LiteRT models) |
|
|
* Memory: indicator of peak RAM usage |
|
|
* The inference on CPU is accelerated via the LiteRT |
|
|
[XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads |
|
|
* Benchmark is done assuming XNNPACK cache is enabled |
|
|
* Benchmark is run with cache enabled and initialized. During the first run, the time to first token may differ. |
|
|
* dynamic_int8: quantized model with int8 weights and float activations. |
|
|
|