Semantic Caching
Caching LLM responses based on the semantic meaning of queries rather than exact string matching. Semantically similar questions return cached answers, reducing latency and cost.
Why It Matters
Semantic caching can reduce LLM API calls by 30-60% for applications with repetitive queries, dramatically cutting costs and improving response times.
Example
Caching the answer to 'What is your return policy?' and serving the same cached response for 'How do I return a product?' and 'Can I send something back?' — same meaning, different words.
Think of it like...
Like a smart FAQ that recognizes you are asking the same question even if you phrase it differently — you get an instant answer instead of waiting.
Related Terms
Embedding
A numerical representation of data (text, images, etc.) as a vector of numbers in a high-dimensional space. Similar items are placed closer together in this space, enabling machines to understand semantic relationships.
Cosine Similarity
A metric that measures the similarity between two vectors by calculating the cosine of the angle between them. Values range from -1 (opposite) to 1 (identical), with 0 meaning unrelated.
Inference
The process of using a trained model to make predictions on new, previously unseen data. Inference is what happens when an AI model is deployed and actively serving results to users.