AI Glossary

The definitive dictionary for AI, Machine Learning, and Governance terminology. From Flash Attention to RAG — look up any term.

All Categories Artificial Intelligence Machine Learning AI Governance Data Science General

A

Annotation

The process of adding labels, tags, or metadata to raw data to make it suitable for supervised machine learning. Annotation can involve labeling images, transcribing audio, or tagging text.

Data Science

C

Concept Drift

A change in the underlying relationship between inputs and outputs over time. Unlike data drift, concept drift means the rules of the game have changed, not just the distribution of inputs.

Data Science

Crowdsourcing

Using a large group of distributed workers (often through platforms like Amazon Mechanical Turk or Scale AI) to perform data annotation and labeling tasks.

Data Science

D

Data Annotation Pipeline

An end-to-end workflow for producing labeled training data, from task design through annotator training, quality assurance, and delivery of labeled datasets.

Data Science

Data Augmentation

Techniques for artificially expanding a training dataset by creating modified versions of existing data. This helps models generalize better, especially when training data is limited.

Data Science

Data Drift

A change in the statistical properties of the input data over time compared to the data the model was trained on. When data drifts, model predictions become less reliable.

Data Science

Data Engineering

The practice of designing, building, and maintaining the systems and infrastructure that collect, store, and prepare data for analysis and machine learning.

Data Science

Data Labeling

The process of assigning meaningful tags or annotations to raw data so it can be used for supervised learning. Labels tell the model what the correct answer should be for each training example.

Data Science

Data Lake

A centralized repository that stores vast amounts of raw data in its native format until needed. Data lakes accept structured, semi-structured, and unstructured data at any scale.

Data Science

Data Lineage

The tracking of data's origins, transformations, and movements throughout its lifecycle. Data lineage answers the question 'Where did this data come from and what happened to it?'

Data Science

Data Mesh

A decentralized approach to data architecture where domain teams own and manage their own data as products, rather than centralizing all data in a single warehouse or lake.

Data Science

Data Pipeline

An automated workflow that extracts data from sources, transforms it through processing steps, and loads it into a destination for use. In ML, data pipelines ensure consistent data flow from raw sources to model training.

Data Science

Data Preprocessing

The process of cleaning, transforming, and organizing raw data into a format suitable for machine learning. This includes handling missing values, encoding categories, scaling features, and removing outliers.

Data Science

Data Quality

The degree to which data is accurate, complete, consistent, timely, and fit for its intended use. Data quality directly impacts the reliability and performance of AI models.

Data Science

Data Warehouse

A structured, organized repository of cleaned and processed data optimized for analysis and reporting. Unlike data lakes, data warehouses store data in defined schemas.

Data Science

E

ETL

Extract, Transform, Load — a data integration process that extracts data from source systems, transforms it into a usable format, and loads it into a destination system.

Data Science

F

Feature Store

A centralized repository for storing, managing, and serving machine learning features. It ensures consistent feature computation between training and serving, and enables feature reuse across teams.

Data Science

Federated Analytics

Techniques for computing analytics and insights across distributed datasets without moving or centralizing the raw data. Each participant computes locally and only shares aggregated results.

Data Science

I

Instruction Dataset

A curated collection of instruction-response pairs used to train or fine-tune models to follow human instructions. The quality and diversity of this dataset directly shapes model behavior.

Data Science

K

Knowledge Base

A structured or semi-structured collection of information used by AI systems to retrieve factual data. In the context of RAG, it typically refers to the document collection that the system can search.

Data Science

Knowledge Graph

A structured representation of real-world entities and the relationships between them, stored as a network of nodes (entities) and edges (relationships). Knowledge graphs capture factual information in a machine-readable format.

Data Science

L

Labeling Platform

Software tools that manage the process of data annotation at scale, including task distribution, quality control, annotator management, and labeling interfaces.

Data Science

O

Ontology

A formal representation of knowledge within a domain that defines concepts, categories, properties, and the relationships between them. It provides a shared vocabulary and structure for organizing information.

Data Science

P

Pinecone

A managed vector database service designed for AI applications. Pinecone handles the infrastructure complexity of storing, indexing, and querying high-dimensional vectors at scale.

Data Science

S

Semantic Web

A vision for extending the World Wide Web so that data is machine-readable and interconnected through shared standards and ontologies. It enables automated reasoning and knowledge discovery.

Data Science

Semi-Structured Data

Data that has some organizational structure but does not conform to a rigid schema like a relational database. Examples include JSON, XML, and HTML.

Data Science

Structured Data

Data organized in a predefined format with clear rows and columns, like spreadsheets and relational databases. Each field has a defined type and meaning.

Data Science

Synthetic Data

Artificially generated data that mimics the statistical properties and patterns of real data. It is created using algorithms, simulations, or generative models rather than collected from real-world events.

Data Science

Synthetic Data Generation

The process of using algorithms, rules, or generative models to create artificial datasets that statistically mirror real data. Used when real data is scarce, sensitive, or biased.

Data Science

Synthetic Reasoning Data

Training data specifically generated to improve AI reasoning capabilities, often using techniques like chain-of-thought examples, math problems, and logical puzzles.

Data Science

T

Test Data

A separate portion of data held back from training that is used to evaluate a model's performance on unseen examples. Test data provides an unbiased estimate of how well the model will perform in the real world.

Data Science

Training Data

The dataset used to teach a machine learning model. It contains examples (and often labels) that the model learns patterns from during the training process. The quality and quantity of training data directly impact model performance.

Data Science

U

Unstructured Data

Data without a predefined format or organization — text documents, images, videos, audio, social media posts. Over 80% of enterprise data is unstructured.

Data Science

V

Validation Data

A subset of data used during training to tune hyperparameters and monitor model performance without touching the test set. It acts as an intermediate checkpoint between training and final evaluation.

Data Science

Vector Database

A specialized database designed to store, index, and search high-dimensional vector embeddings efficiently. It enables fast similarity searches across millions or billions of vectors.

Data Science