// Glossary

AI Memory & Agent Glossary

Technical definitions for AI memory, vector search, and agent architectures. Built for engineers.

Definition
Context Engineering
The systematic discipline of designing, structuring, and managing the information provided to a Large Language Model (LLM) to maximize its reasoning capability and accuracy.
Definition
Agentic Engineering
The discipline of designing, orchestrating, and evaluating systems where AI agents autonomously plan and execute multi-step workflows.
Definition
Flow Engineering
The practice of designing deterministic, multi-step architectures where an LLM's output is iteratively refined, tested, and corrected through a predefined pipeline.
Definition
Test-Time Compute (TTC)
The practice of allocating more computational resources during inference (the 'test time') to allow an AI model to perform internal reasoning, planning, and self-correction before outputting an answer.
Definition
Model Context Protocol (MCP)
An open-source, standardized protocol that acts as a universal API allowing AI models to securely connect to, read from, and interact with external data sources and tools.
Definition
Semantic Caching
An optimization technique that stores previous LLM responses and uses vector similarity to return those cached answers for new queries that have the same semantic meaning, bypassing the LLM entirely.
Definition
Matryoshka Representation Learning (MRL)
A technique for training embedding models so that the most critical semantic information is front-loaded into the earliest dimensions of the vector, allowing the vector to be truncated to save space without catastrophic accuracy loss.
Definition
Cross-Encoder Reranking
An architecture where an embedding model processes both the user's query and a target document simultaneously through its attention layers to generate a highly precise relevance score.
Definition
Late Chunking
A RAG optimization where an entire document is processed by the embedding model first to capture global context, and only then split into individual searchable chunks.
Definition
GraphRAG
An advanced retrieval technique that extracts entities and relationships from documents to build a Knowledge Graph, allowing an LLM to answer complex queries that span across multiple disparate documents.
Definition
Tool Calling
A capability fine-tuned into modern LLMs that allows them to intelligently select an external tool and output the exact JSON arguments required to execute it.
Definition
PagedAttention
An attention algorithm inspired by operating system memory paging that non-contiguously stores an LLM's KV cache, dramatically increasing the throughput of AI servers.
Definition
KV Cache Eviction
Algorithms that selectively delete less important tokens from an LLM's Key-Value cache during inference to prevent GPU memory exhaustion during long conversations.
Definition
Semantic Chunking
A method of splitting text for vector databases that evaluates the embedding distance between sentences, ensuring that logically coherent thoughts are kept together in a single chunk.
Definition
Hypothetical Document Embeddings (HyDE)
A zero-shot retrieval technique where a user's question is first answered by an LLM (hallucinating a hypothetical document), and that hallucinated answer is embedded and used to search the vector database.
Definition
Self-Querying Retrieval
An intelligent retrieval mechanism where an LLM parses a user's natural language query to automatically extract exact metadata filters (like dates or authors) while simultaneously performing a semantic vector search.
Definition
Bi-Encoder Architecture
A neural network architecture where two separate inputs (like a query and a document) are passed through an embedding model independently to create two distinct vectors that can be compared mathematically.
Definition
Product Quantization (PQ)
A highly efficient lossy compression algorithm used in vector databases to drastically reduce the memory footprint of embeddings while maintaining fast nearest-neighbor search capabilities.
Definition
Hierarchical Navigable Small World (HNSW)
A graph-based algorithm that organizes vector embeddings into multiple layers, enabling sub-millisecond Approximate Nearest Neighbor (ANN) search across billions of data points.
Definition
ReAct Prompting
An agentic framework that strictly interleaves internal reasoning (Chain of Thought) with external actions (Tool Calling), allowing an LLM to solve complex, multi-step problems autonomously.
Definition
Plan-and-Solve Framework
An agentic architecture where a complex objective is broken down into a complete, sequential list of sub-tasks before any actual execution or tool calling begins.
Definition
Speculative Decoding
An inference optimization technique that accelerates text generation by using a small, fast 'draft' model to predict multiple upcoming tokens, which are then verified simultaneously by the main, larger model.
Definition
Agentic Router
A lightweight classifier model or programmatic logic layer that analyzes an incoming user query and routes it to the specific AI agent, tool, or LLM best suited to handle it.
Definition
JSON Mode vs. Structured Outputs
JSON Mode heavily biases an LLM to produce valid JSON syntax, whereas Structured Outputs mathematically constrain the model's token generation to guarantee the output perfectly matches a specific JSON Schema.
Definition
DSPy
An open-source framework developed by Stanford that replaces manual prompt engineering with programmatic compilation, allowing developers to write declarative logic that automatically optimizes the prompts sent to an LLM.