Instructions to use tiiuae/falcon-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tiiuae/falcon-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tiiuae/falcon-7b", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tiiuae/falcon-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tiiuae/falcon-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/tiiuae/falcon-7b
- SGLang
How to use tiiuae/falcon-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use tiiuae/falcon-7b with Docker Model Runner:
docker model run hf.co/tiiuae/falcon-7b
How to avoid running into memory/ storage problems associated with HF Spaces while using tiiuae/falcon-7b 0r 40b etc.,
Please note that I did not encounter the problems that are explained here, with many other LLM I tried. I am trying to host an app using this model (in fact I tried 40b and instruct models as well). When the container is being built, it runs into some memory/ storage related issues related to HF Spaces free account.
The first problem is you get the error -"ValueError: The current
device_maphad weights offloaded to the disk. Please provide anoffload_folderfor them. Alternatively, make sure you havesafetensorsinstalled if the model you are using offers the weights in this format."So accordingly after installing 'safetensors', I tried again. Still the problem persists. So, I assume Falcon models are not safetensors (hope someone can confirm). When I pass the 'offload_folder="offload"' parameter to 'AutoModelForCausalLM.from_pretrained', it does seem to be working but runs into a memory issue, shown below, while loading checkpoint shards.
- While performing the above step with 40B model, it actually runs out of 50G storage space limit.
Appreciate if someone can help with some suggestions here.
You might want to load the model this way:
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", offload_folder="offload")
That is how I have loaded the model already.
checkpoint, device_map="auto", offload_folder="offload", trust_remote_code=True,)```
Any solution to this problem please?
Hello Vinayaru , did you find a solution for the problem , i have the same error , using the same model.
Best regards
Noureddine
