AI Glossary
The definitive dictionary for AI, Machine Learning, and Governance terminology. From Flash Attention to RAG — look up any term.
C
Concept Drift
A change in the underlying relationship between inputs and outputs over time. Unlike data drift, concept drift means the rules of the game have changed, not just the distribution of inputs.
Crowdsourcing
Using a large group of distributed workers (often through platforms like Amazon Mechanical Turk or Scale AI) to perform data annotation and labeling tasks.
D
Data Annotation Pipeline
An end-to-end workflow for producing labeled training data, from task design through annotator training, quality assurance, and delivery of labeled datasets.
Data Augmentation
Techniques for artificially expanding a training dataset by creating modified versions of existing data. This helps models generalize better, especially when training data is limited.
Data Drift
A change in the statistical properties of the input data over time compared to the data the model was trained on. When data drifts, model predictions become less reliable.
Data Engineering
The practice of designing, building, and maintaining the systems and infrastructure that collect, store, and prepare data for analysis and machine learning.
Data Labeling
The process of assigning meaningful tags or annotations to raw data so it can be used for supervised learning. Labels tell the model what the correct answer should be for each training example.
Data Lake
A centralized repository that stores vast amounts of raw data in its native format until needed. Data lakes accept structured, semi-structured, and unstructured data at any scale.
Data Lineage
The tracking of data's origins, transformations, and movements throughout its lifecycle. Data lineage answers the question 'Where did this data come from and what happened to it?'
Data Mesh
A decentralized approach to data architecture where domain teams own and manage their own data as products, rather than centralizing all data in a single warehouse or lake.
Data Pipeline
An automated workflow that extracts data from sources, transforms it through processing steps, and loads it into a destination for use. In ML, data pipelines ensure consistent data flow from raw sources to model training.
Data Preprocessing
The process of cleaning, transforming, and organizing raw data into a format suitable for machine learning. This includes handling missing values, encoding categories, scaling features, and removing outliers.
Data Quality
The degree to which data is accurate, complete, consistent, timely, and fit for its intended use. Data quality directly impacts the reliability and performance of AI models.
Data Warehouse
A structured, organized repository of cleaned and processed data optimized for analysis and reporting. Unlike data lakes, data warehouses store data in defined schemas.
F
Feature Store
A centralized repository for storing, managing, and serving machine learning features. It ensures consistent feature computation between training and serving, and enables feature reuse across teams.
Federated Analytics
Techniques for computing analytics and insights across distributed datasets without moving or centralizing the raw data. Each participant computes locally and only shares aggregated results.
K
Knowledge Base
A structured or semi-structured collection of information used by AI systems to retrieve factual data. In the context of RAG, it typically refers to the document collection that the system can search.
Knowledge Graph
A structured representation of real-world entities and the relationships between them, stored as a network of nodes (entities) and edges (relationships). Knowledge graphs capture factual information in a machine-readable format.
S
Semantic Web
A vision for extending the World Wide Web so that data is machine-readable and interconnected through shared standards and ontologies. It enables automated reasoning and knowledge discovery.
Semi-Structured Data
Data that has some organizational structure but does not conform to a rigid schema like a relational database. Examples include JSON, XML, and HTML.
Structured Data
Data organized in a predefined format with clear rows and columns, like spreadsheets and relational databases. Each field has a defined type and meaning.
Synthetic Data
Artificially generated data that mimics the statistical properties and patterns of real data. It is created using algorithms, simulations, or generative models rather than collected from real-world events.
Synthetic Data Generation
The process of using algorithms, rules, or generative models to create artificial datasets that statistically mirror real data. Used when real data is scarce, sensitive, or biased.
Synthetic Reasoning Data
Training data specifically generated to improve AI reasoning capabilities, often using techniques like chain-of-thought examples, math problems, and logical puzzles.
T
Test Data
A separate portion of data held back from training that is used to evaluate a model's performance on unseen examples. Test data provides an unbiased estimate of how well the model will perform in the real world.
Training Data
The dataset used to teach a machine learning model. It contains examples (and often labels) that the model learns patterns from during the training process. The quality and quantity of training data directly impact model performance.
V
Validation Data
A subset of data used during training to tune hyperparameters and monitor model performance without touching the test set. It acts as an intermediate checkpoint between training and final evaluation.
Vector Database
A specialized database designed to store, index, and search high-dimensional vector embeddings efficiently. It enables fast similarity searches across millions or billions of vectors.