Prompt Compression
Techniques for reducing the token count of prompts while preserving their essential meaning, enabling more efficient use of context windows and reducing API costs.
Why It Matters
Prompt compression can reduce token usage by 50-70% while maintaining output quality, directly cutting API costs and fitting more context into limited windows.
Example
Compressing a 2,000-token RAG context into 600 tokens by removing redundant information and preserving only the key facts relevant to the query.
Think of it like...
Like summarizing a briefing document before a meeting — you capture the essential points in fewer words so the decision-maker can process it efficiently.
Related Terms
Context Window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction. It includes both the input prompt and the generated output. Larger context windows allow models to handle longer documents.
Token
The basic unit of text that language models process. A token can be a word, part of a word, or a punctuation mark. Text is broken into tokens before being fed into an LLM, and the model generates output one token at a time.
Summarization
The NLP task of condensing a longer text into a shorter version while preserving the key information and main points. Summarization can be extractive (selecting key sentences) or abstractive (generating new text).
Retrieval-Augmented Generation
A technique that enhances LLM outputs by first retrieving relevant information from external knowledge sources and then using that information as context for generation. RAG combines the power of search with the fluency of language models.