my-project-public

repository

loading code, commits, and activity

repositories

loading repo index

#1	# Mem0: Building Production‑Ready AI Agents with Scalable Long‑Term Memory
#2
#3	[![arXiv](https://img.shields.io/badge/arXiv-Paper-b31b1b.svg)](https://arxiv.org/abs/2504.19413)
#4	[![Website](https://img.shields.io/badge/Website-Project-blue)](https://mem0.ai/research)
#5
#6	This repository contains the code and dataset for our paper: Mem0: Building Production‑Ready AI Agents with Scalable Long‑Term Memory.
#7
#8	## 📋 Overview
#9
#10	This project evaluates Mem0 and compares it with different memory and retrieval techniques for AI systems:
#11
#12	1. Established LOCOMO Benchmarks: We evaluate against five established approaches from the literature: LoCoMo, ReadAgent, MemoryBank, MemGPT, and A-Mem.
#13	2. Open-Source Memory Solutions: We test promising open-source memory architectures including LangMem, which provides flexible memory management capabilities.
#14	3. RAG Systems: We implement Retrieval-Augmented Generation with various configurations, testing different chunk sizes and retrieval counts to optimize performance.
#15	4. Full-Context Processing: We examine the effectiveness of passing the entire conversation history within the context window of the LLM as a baseline approach.
#16	5. Proprietary Memory Systems: We evaluate OpenAI's built-in memory feature available in their ChatGPT interface to compare against commercial solutions.
#17	6. Third-Party Memory Providers: We incorporate Zep, a specialized memory management platform designed for AI agents, to assess the performance of dedicated memory infrastructure.
#18
#19	We test these techniques on the LOCOMO dataset, which contains conversational data with various question types to evaluate memory recall and understanding.
#20
#21	## 🔍 Dataset
#22
#23	The LOCOMO dataset used in our experiments can be downloaded from our Google Drive repository:
#24
#25	[Download LOCOMO Dataset](https://drive.google.com/drive/folders/1L-cTjTm0ohMsitsHg4dijSPJtqNflwX-?usp=drive_link)
#26
#27	The dataset contains conversational data specifically designed to test memory recall and understanding across various question types and complexity levels.
#28
#29	Place the dataset files in the `dataset/` directory:
#30	- `locomo10.json`: Original dataset
#31	- `locomo10_rag.json`: Dataset formatted for RAG experiments
#32
#33	## 📁 Project Structure
#34
#35	```
#36	.
#37	├── src/ # Source code for different memory techniques
#38	│ ├── mem0/ # Implementation of the Mem0 technique
#39	│ ├── openai/ # Implementation of the OpenAI memory
#40	│ ├── zep/ # Implementation of the Zep memory
#41	│ ├── rag.py # Implementation of the RAG technique
#42	│ └── langmem.py # Implementation of the Language-based memory
#43	├── metrics/ # Code for evaluation metrics
#44	├── results/ # Results of experiments
#45	├── dataset/ # Dataset files
#46	├── evals.py # Evaluation script
#47	├── run_experiments.py # Script to run experiments
#48	├── generate_scores.py # Script to generate scores from results
#49	└── prompts.py # Prompts used for the models
#50	```
#51
#52	## 🚀 Getting Started
#53
#54	### Prerequisites
#55
#56	Create a `.env` file with your API keys and configurations. The following keys are required:
#57
#58	```
#59	# OpenAI API key for GPT models and embeddings
#60	OPENAI_API_KEY="your-openai-api-key"
#61
#62	# Mem0 API keys (for Mem0 and Mem0+ techniques)
#63	MEM0_API_KEY="your-mem0-api-key"
#64	MEM0_PROJECT_ID="your-mem0-project-id"
#65	MEM0_ORGANIZATION_ID="your-mem0-organization-id"
#66
#67	# Model configuration
#68	MODEL="gpt-4o-mini" # or your preferred model
#69	EMBEDDING_MODEL="text-embedding-3-small" # or your preferred embedding model
#70	ZEP_API_KEY="api-key-from-zep"
#71	```
#72
#73	### Running Experiments
#74
#75	You can run experiments using the provided Makefile commands:
#76
#77	#### Memory Techniques
#78
#79	```bash
#80	# Run Mem0 experiments
#81	make run-mem0-add # Add memories using Mem0
#82	make run-mem0-search # Search memories using Mem0
#83
#84	# Run Mem0+ experiments (with graph-based search)
#85	make run-mem0-plus-add # Add memories using Mem0+
#86	make run-mem0-plus-search # Search memories using Mem0+
#87
#88	# Run RAG experiments
#89	make run-rag # Run RAG with chunk size 500
#90	make run-full-context # Run RAG with full context
#91
#92	# Run LangMem experiments
#93	make run-langmem # Run LangMem
#94
#95	# Run Zep experiments
#96	make run-zep-add # Add memories using Zep
#97	make run-zep-search # Search memories using Zep
#98
#99	# Run OpenAI experiments
#100	make run-openai # Run OpenAI experiments
#101	```
#102
#103	Alternatively, you can run experiments directly with custom parameters:
#104
#105	```bash
#106	python run_experiments.py --technique_type [mem0\|rag\|langmem] [additional parameters]
#107	```
#108
#109	#### Command-line Parameters:
#110
#111	\| Parameter \| Description \| Default \|
#112	\|-----------\|-------------\|---------\|
#113	\| `--technique_type` \| Memory technique to use (mem0, rag, langmem) \| mem0 \|
#114	\| `--method` \| Method to use (add, search) \| add \|
#115	\| `--chunk_size` \| Chunk size for processing \| 1000 \|
#116	\| `--top_k` \| Number of top memories to retrieve \| 30 \|
#117	\| `--filter_memories` \| Whether to filter memories \| False \|
#118	\| `--is_graph` \| Whether to use graph-based search \| False \|
#119	\| `--num_chunks` \| Number of chunks to process for RAG \| 1 \|
#120
#121	### 📊 Evaluation
#122
#123	To evaluate results, run:
#124
#125	```bash
#126	python evals.py --input_file [path_to_results] --output_file [output_path]
#127	```
#128
#129	This script:
#130	1. Processes each question-answer pair
#131	2. Calculates BLEU and F1 scores automatically
#132	3. Uses an LLM judge to evaluate answer correctness
#133	4. Saves the combined results to the output file
#134
#135	### 📈 Generating Scores
#136
#137	Generate final scores with:
#138
#139	```bash
#140	python generate_scores.py
#141	```
#142
#143	This script:
#144	1. Loads the evaluation metrics data
#145	2. Calculates mean scores for each category (BLEU, F1, LLM)
#146	3. Reports the number of questions per category
#147	4. Calculates overall mean scores across all categories
#148
#149	Example output:
#150	```
#151	Mean Scores Per Category:
#152	bleu_score f1_score llm_score count
#153	category
#154	1 0.xxxx 0.xxxx 0.xxxx xx
#155	2 0.xxxx 0.xxxx 0.xxxx xx
#156	3 0.xxxx 0.xxxx 0.xxxx xx
#157
#158	Overall Mean Scores:
#159	bleu_score 0.xxxx
#160	f1_score 0.xxxx
#161	llm_score 0.xxxx
#162	```
#163
#164	## 📏 Evaluation Metrics
#165
#166	We use several metrics to evaluate the performance of different memory techniques:
#167
#168	1. BLEU Score: Measures the similarity between the model's response and the ground truth
#169	2. F1 Score: Measures the harmonic mean of precision and recall
#170	3. LLM Score: A binary score (0 or 1) determined by an LLM judge evaluating the correctness of responses
#171	4. Token Consumption: Number of tokens required to generate final answer.
#172	5. Latency: Time required during search and to generate response.
#173
#174	## 📚 Citation
#175
#176	If you use this code or dataset in your research, please cite our paper:
#177
#178	```bibtex
#179	@article{mem0,
#180	title={Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory},
#181	author={Chhikara, Prateek and Khant, Dev and Aryan, Saket and Singh, Taranjeet and Yadav, Deshraj},
#182	journal={arXiv preprint arXiv:2504.19413},
#183	year={2025}
#184	}
#185	```
#186
#187	## 📄 License
#188
#189	[MIT License](LICENSE)
#190
#191	## 👥 Contributors
#192
#193	- [Prateek Chhikara](https://github.com/prateekchhikara)
#194	- [Dev Khant](https://github.com/Dev-Khant)
#195	- [Saket Aryan](https://github.com/whysosaket)
#196	- [Taranjeet Singh](https://github.com/taranjeet)
#197	- [Deshraj Yadav](https://github.com/deshraj)
#198
#199

z6Mkq5mY3JWtxoxUobWcfNHm7AkRubgSWEZTkBVqZXJviFZ5/my-project-public