Instructions to use masonbarnes/open-llm-search with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use masonbarnes/open-llm-search with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="masonbarnes/open-llm-search", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("masonbarnes/open-llm-search", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("masonbarnes/open-llm-search", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use masonbarnes/open-llm-search with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "masonbarnes/open-llm-search"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "masonbarnes/open-llm-search",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/masonbarnes/open-llm-search

SGLang

How to use masonbarnes/open-llm-search with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "masonbarnes/open-llm-search" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "masonbarnes/open-llm-search",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "masonbarnes/open-llm-search" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "masonbarnes/open-llm-search",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use masonbarnes/open-llm-search with Docker Model Runner:
```
docker model run hf.co/masonbarnes/open-llm-search
```

Model Overview

As the demand for large language models grows, a common limitation surfaces: their inability to directly search the internet. Although tech giants like Google (with Bard), Bing, and Perplexity are addressing this challenge, their proprietary methods have data logging issues.

Introducing Open LLM Search — A specialized adaptation of Together AI's llama-2-7b-32k model, purpose-built for extracting information from web pages. While the model only has a 7 billion parameters, its fine-tuned capabilities and expanded context limit enable it to excel in search tasks.

License: This model uses Meta's Llama 2 license.

Fine-Tuning Process

The model's fine tuning involved a combination of GPT-4 and GPT-4-32k to generate synthetic data. Here is the training workflow used:

Use GPT-4 to generate a multitude of queries.
For each query, identify the top five website results from Google.
Extract content from these websites and use GPT-4-32k for their summarization.
Record the text and summarizes from GPT-4-32k for fine-tuning.
Feed the summaries from all five sources with GPT-4 to craft a cohesive response.
Document both the input and output from GPT-4 for fine-tuning.

Fine tuning was done with an <instructions>:, <user>:, and <assistant>: format.

Getting Started

Experience it firsthand! Check out the live demo here.
For DIY enthusiasts, explore or self-deploy this solution using our GitHub repository.

Downloads last month: 10