Instructions to use solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ") model = AutoModelForCausalLM.from_pretrained("solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ
- SGLang
How to use solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ with Docker Model Runner:
docker model run hf.co/solidrust/Nous-Hermes-2-Mistral-7B-DPO-AWQ
Add missing quant_config.json for compatibility with vLLM backends out of the box.
Thank-you.
Would you know how to AWQ Starling-LM-7B-beta? It seem that it could be a better model still.
Would you know how to AWQ Starling-LM-7B-beta? It seem that it could be a better model still.
I just tested it at full bfloat16 and it doesn't seem to respond well, also it has a tiny context window (8192) compared to other Mistral fine tunes.
Today I compared Nous Hermes 2 Pro 7B with Gorilla LLM 7B, Raven v2 13B and Starling 7B.
did you try the Alpha version: TheBloke/Starling-LM-7B-alpha-AWQ
I can make a quant of the beta now if you like.
it is simple, as I just use the example script from the CasperHansen AutoAWQ repo.
OK, the 'Nexusflow/Starling-LM-7B-beta' model is in the AWQ quant queue now.
Would you know how to AWQ Starling-LM-7B-beta? It seem that it could be a better model still.
I just tested it at full bfloat16 and it doesn't seem to respond well, also it has a tiny context window (8192) compared to other Mistral fine tunes.
"Nous Hermes 2 - Mistral 7B - DPO" is fine-tune originaly from Mistral-7B-v0.1 which has 8k token context. Only the newer Mistral-7B-v0.2 has 32k context.
I tried the EagleX on CPU today. Incredibly slow.
Just because the original Mistral model was limited to 16k context with a 4k sliding window, does not make fine-tune variants have the same limitations. This Nous Hermes 2 Pro handles up to 32k context.
I have only been able to use it with 16k context, due to a VRAM limitation. Maybe check some examples of LLlama with 128k context, to learn more about how these authors are widening the default context window.
This Starling quant is on it's way. uploading the AWQ now: https://huggingface.co/solidrust/Starling-LM-7B-beta-AWQ
Hermes-2-Pro-Mistral-7B is interesting, but I supect that for chat without functions DPO version will be better.
You were right, the Starling-LM-7B-beta-AWQ is not that good. It is very chatgpt like sounding and does not follow instructions. I am testing the Hermes-2-Pro-Mistral-7B.