// Glossary

AI Memory & Agent Glossary

Technical definitions for AI memory, vector search, and agent architectures. Built for engineers.

Context Engineering

The systematic discipline of designing, structuring, and managing the information provided to a Large Language Model (LLM) to maximize its reasoning capability and accuracy.

Agentic Engineering

The discipline of designing, orchestrating, and evaluating systems where AI agents autonomously plan and execute multi-step workflows.

Flow Engineering

The practice of designing deterministic, multi-step architectures where an LLM's output is iteratively refined, tested, and corrected through a predefined pipeline.

Test-Time Compute (TTC)

The practice of allocating more computational resources during inference (the 'test time') to allow an AI model to perform internal reasoning, planning, and self-correction before outputting an answer.

Model Context Protocol (MCP)

An open-source, standardized protocol that acts as a universal API allowing AI models to securely connect to, read from, and interact with external data sources and tools.

Semantic Caching

An optimization technique that stores previous LLM responses and uses vector similarity to return those cached answers for new queries that have the same semantic meaning, bypassing the LLM entirely.

Matryoshka Representation Learning (MRL)

A technique for training embedding models so that the most critical semantic information is front-loaded into the earliest dimensions of the vector, allowing the vector to be truncated to save space without catastrophic accuracy loss.

Cross-Encoder Reranking

An architecture where an embedding model processes both the user's query and a target document simultaneously through its attention layers to generate a highly precise relevance score.

A RAG optimization where an entire document is processed by the embedding model first to capture global context, and only then split into individual searchable chunks.

An advanced retrieval technique that extracts entities and relationships from documents to build a Knowledge Graph, allowing an LLM to answer complex queries that span across multiple disparate documents.

A capability fine-tuned into modern LLMs that allows them to intelligently select an external tool and output the exact JSON arguments required to execute it.

An attention algorithm inspired by operating system memory paging that non-contiguously stores an LLM's KV cache, dramatically increasing the throughput of AI servers.

KV Cache Eviction

Algorithms that selectively delete less important tokens from an LLM's Key-Value cache during inference to prevent GPU memory exhaustion during long conversations.

Semantic Chunking

A method of splitting text for vector databases that evaluates the embedding distance between sentences, ensuring that logically coherent thoughts are kept together in a single chunk.

Hypothetical Document Embeddings (HyDE)

A zero-shot retrieval technique where a user's question is first answered by an LLM (hallucinating a hypothetical document), and that hallucinated answer is embedded and used to search the vector database.

Self-Querying Retrieval

An intelligent retrieval mechanism where an LLM parses a user's natural language query to automatically extract exact metadata filters (like dates or authors) while simultaneously performing a semantic vector search.

Bi-Encoder Architecture

A neural network architecture where two separate inputs (like a query and a document) are passed through an embedding model independently to create two distinct vectors that can be compared mathematically.

Product Quantization (PQ)

A highly efficient lossy compression algorithm used in vector databases to drastically reduce the memory footprint of embeddings while maintaining fast nearest-neighbor search capabilities.

Hierarchical Navigable Small World (HNSW)

A graph-based algorithm that organizes vector embeddings into multiple layers, enabling sub-millisecond Approximate Nearest Neighbor (ANN) search across billions of data points.

ReAct Prompting

An agentic framework that strictly interleaves internal reasoning (Chain of Thought) with external actions (Tool Calling), allowing an LLM to solve complex, multi-step problems autonomously.

Plan-and-Solve Framework

An agentic architecture where a complex objective is broken down into a complete, sequential list of sub-tasks before any actual execution or tool calling begins.

Speculative Decoding

An inference optimization technique that accelerates text generation by using a small, fast 'draft' model to predict multiple upcoming tokens, which are then verified simultaneously by the main, larger model.

A lightweight classifier model or programmatic logic layer that analyzes an incoming user query and routes it to the specific AI agent, tool, or LLM best suited to handle it.

JSON Mode vs. Structured Outputs

JSON Mode heavily biases an LLM to produce valid JSON syntax, whereas Structured Outputs mathematically constrain the model's token generation to guarantee the output perfectly matches a specific JSON Schema.

An open-source framework developed by Stanford that replaces manual prompt engineering with programmatic compilation, allowing developers to write declarative logic that automatically optimizes the prompts sent to an LLM.