repositories
loading repo index
repositories
loading repo index
repository
loading code, commits, and activity
public Clawd ADK gateway launch mirror
stars
latest
clone command
git clone gitlawb://did:key:z6Mkq5mY...iFZ5/my-project-publ...git clone gitlawb://did:key:z6Mkq5mY.../my-project-publ...2fa351d6docs: add automaton and perps launch sources16d ago| #1 | --- |
| #2 | title: vLLM |
| #3 | --- |
| #4 | |
| #5 | [vLLM](https://docs.vllm.ai/) is a high-performance inference engine for large language models that provides significant performance improvements for local inference. It's designed to maximize throughput and memory efficiency for serving LLMs. |
| #6 | |
| #7 | ## Prerequisites |
| #8 | |
| #9 | 1. **Install vLLM**: |
| #10 | |
| #11 | ```bash |
| #12 | pip install vllm |
| #13 | ``` |
| #14 | |
| #15 | 2. **Start vLLM server**: |
| #16 | |
| #17 | ```bash |
| #18 | # For testing with a small model |
| #19 | vllm serve microsoft/DialoGPT-medium --port 8000 |
| #20 | |
| #21 | # For production with a larger model (requires GPU) |
| #22 | vllm serve Qwen/Qwen2.5-32B-Instruct --port 8000 |
| #23 | ``` |
| #24 | |
| #25 | ## Usage |
| #26 | |
| #27 | ```python |
| #28 | import os |
| #29 | from mem0 import Memory |
| #30 | |
| #31 | os.environ["OPENAI_API_KEY"] = "your-api-key" # used for embedding model |
| #32 | |
| #33 | config = { |
| #34 | "llm": { |
| #35 | "provider": "vllm", |
| #36 | "config": { |
| #37 | "model": "Qwen/Qwen2.5-32B-Instruct", |
| #38 | "vllm_base_url": "http://localhost:8000/v1", |
| #39 | "temperature": 0.1, |
| #40 | "max_tokens": 2000, |
| #41 | } |
| #42 | } |
| #43 | } |
| #44 | |
| #45 | m = Memory.from_config(config) |
| #46 | messages = [ |
| #47 | {"role": "user", "content": "I'm planning to watch a movie tonight. Any recommendations?"}, |
| #48 | {"role": "assistant", "content": "How about thriller movies? They can be quite engaging."}, |
| #49 | {"role": "user", "content": "I'm not a big fan of thrillers, but I love sci-fi movies."}, |
| #50 | {"role": "assistant", "content": "Got it! I'll avoid thrillers and suggest sci-fi movies instead."} |
| #51 | ] |
| #52 | m.add(messages, user_id="alice", metadata={"category": "movies"}) |
| #53 | ``` |
| #54 | |
| #55 | ## Configuration Parameters |
| #56 | |
| #57 | | Parameter | Description | Default | Environment Variable | |
| #58 | | --------------- | --------------------------------- | ----------------------------- | -------------------- | |
| #59 | | `model` | Model name running on vLLM server | `"Qwen/Qwen2.5-32B-Instruct"` | - | |
| #60 | | `vllm_base_url` | vLLM server URL | `"http://localhost:8000/v1"` | `VLLM_BASE_URL` | |
| #61 | | `api_key` | API key (dummy for local) | `"vllm-api-key"` | `VLLM_API_KEY` | |
| #62 | | `temperature` | Sampling temperature | `0.1` | - | |
| #63 | | `max_tokens` | Maximum tokens to generate | `2000` | - | |
| #64 | |
| #65 | ## Environment Variables |
| #66 | |
| #67 | You can set these environment variables instead of specifying them in config: |
| #68 | |
| #69 | ```bash |
| #70 | export VLLM_BASE_URL="http://localhost:8000/v1" |
| #71 | export VLLM_API_KEY="your-vllm-api-key" |
| #72 | export OPENAI_API_KEY="your-openai-api-key" # for embeddings |
| #73 | ``` |
| #74 | |
| #75 | ## Benefits |
| #76 | |
| #77 | - **High Performance**: 2-24x faster inference than standard implementations |
| #78 | - **Memory Efficient**: Optimized memory usage with PagedAttention |
| #79 | - **Local Deployment**: Keep your data private and reduce API costs |
| #80 | - **Easy Integration**: Drop-in replacement for other LLM providers |
| #81 | - **Flexible**: Works with any model supported by vLLM |
| #82 | |
| #83 | ## Troubleshooting |
| #84 | |
| #85 | 1. **Server not responding**: Make sure vLLM server is running |
| #86 | |
| #87 | ```bash |
| #88 | curl http://localhost:8000/health |
| #89 | ``` |
| #90 | |
| #91 | 2. **404 errors**: Ensure correct base URL format |
| #92 | |
| #93 | ```python |
| #94 | "vllm_base_url": "http://localhost:8000/v1" # Note the /v1 |
| #95 | ``` |
| #96 | |
| #97 | 3. **Model not found**: Check model name matches server |
| #98 | |
| #99 | 4. **Out of memory**: Try smaller models or reduce `max_model_len` |
| #100 | |
| #101 | ```bash |
| #102 | vllm serve Qwen/Qwen2.5-32B-Instruct --max-model-len 4096 |
| #103 | ``` |
| #104 | |
| #105 | ## Config |
| #106 | |
| #107 | All available parameters for the `vllm` config are present in [Master List of All Params in Config](../config). |
| #108 |