my-project-public

repository

loading code, commits, and activity

repositories

loading repo index

#1	---
#2	title: vLLM
#3	---
#4
#5	[vLLM](https://docs.vllm.ai/) is a high-performance inference engine for large language models that provides significant performance improvements for local inference. It's designed to maximize throughput and memory efficiency for serving LLMs.
#6
#7	## Prerequisites
#8
#9	1. Install vLLM:
#10
#11	```bash
#12	pip install vllm
#13	```
#14
#15	2. Start vLLM server:
#16
#17	```bash
#18	# For testing with a small model
#19	vllm serve microsoft/DialoGPT-medium --port 8000
#20
#21	# For production with a larger model (requires GPU)
#22	vllm serve Qwen/Qwen2.5-32B-Instruct --port 8000
#23	```
#24
#25	## Usage
#26
#27	```python
#28	import os
#29	from mem0 import Memory
#30
#31	os.environ["OPENAI_API_KEY"] = "your-api-key" # used for embedding model
#32
#33	config = {
#34	"llm": {
#35	"provider": "vllm",
#36	"config": {
#37	"model": "Qwen/Qwen2.5-32B-Instruct",
#38	"vllm_base_url": "http://localhost:8000/v1",
#39	"temperature": 0.1,
#40	"max_tokens": 2000,
#41	}
#42	}
#43	}
#44
#45	m = Memory.from_config(config)
#46	messages = [
#47	{"role": "user", "content": "I'm planning to watch a movie tonight. Any recommendations?"},
#48	{"role": "assistant", "content": "How about thriller movies? They can be quite engaging."},
#49	{"role": "user", "content": "I'm not a big fan of thrillers, but I love sci-fi movies."},
#50	{"role": "assistant", "content": "Got it! I'll avoid thrillers and suggest sci-fi movies instead."}
#51	]
#52	m.add(messages, user_id="alice", metadata={"category": "movies"})
#53	```
#54
#55	## Configuration Parameters
#56
#57	\| Parameter \| Description \| Default \| Environment Variable \|
#58	\| --------------- \| --------------------------------- \| ----------------------------- \| -------------------- \|
#59	\| `model` \| Model name running on vLLM server \| `"Qwen/Qwen2.5-32B-Instruct"` \| - \|
#60	\| `vllm_base_url` \| vLLM server URL \| `"http://localhost:8000/v1"` \| `VLLM_BASE_URL` \|
#61	\| `api_key` \| API key (dummy for local) \| `"vllm-api-key"` \| `VLLM_API_KEY` \|
#62	\| `temperature` \| Sampling temperature \| `0.1` \| - \|
#63	\| `max_tokens` \| Maximum tokens to generate \| `2000` \| - \|
#64
#65	## Environment Variables
#66
#67	You can set these environment variables instead of specifying them in config:
#68
#69	```bash
#70	export VLLM_BASE_URL="http://localhost:8000/v1"
#71	export VLLM_API_KEY="your-vllm-api-key"
#72	export OPENAI_API_KEY="your-openai-api-key" # for embeddings
#73	```
#74
#75	## Benefits
#76
#77	- High Performance: 2-24x faster inference than standard implementations
#78	- Memory Efficient: Optimized memory usage with PagedAttention
#79	- Local Deployment: Keep your data private and reduce API costs
#80	- Easy Integration: Drop-in replacement for other LLM providers
#81	- Flexible: Works with any model supported by vLLM
#82
#83	## Troubleshooting
#84
#85	1. Server not responding: Make sure vLLM server is running
#86
#87	```bash
#88	curl http://localhost:8000/health
#89	```
#90
#91	2. 404 errors: Ensure correct base URL format
#92
#93	```python
#94	"vllm_base_url": "http://localhost:8000/v1" # Note the /v1
#95	```
#96
#97	3. Model not found: Check model name matches server
#98
#99	4. Out of memory: Try smaller models or reduce `max_model_len`
#100
#101	```bash
#102	vllm serve Qwen/Qwen2.5-32B-Instruct --max-model-len 4096
#103	```
#104
#105	## Config
#106
#107	All available parameters for the `vllm` config are present in [Master List of All Params in Config](../config).
#108

z6Mkq5mY3JWtxoxUobWcfNHm7AkRubgSWEZTkBVqZXJviFZ5/my-project-public