ETL
Extract, Transform, Load — a data integration process that extracts data from source systems, transforms it into a usable format, and loads it into a destination system.
Why It Matters
ETL is the backbone of data infrastructure. Without it, raw data remains siloed and unusable for AI/ML applications.
Example
Extracting customer data from Salesforce, transforming it by cleaning addresses, normalizing phone numbers, and deduplicating records, then loading it into a data warehouse.
Think of it like...
Like a food processing plant — raw ingredients come in, are washed, cut, packaged, and placed on shelves ready for use.
Related Terms
Data Pipeline
An automated workflow that extracts data from sources, transforms it through processing steps, and loads it into a destination for use. In ML, data pipelines ensure consistent data flow from raw sources to model training.
Data Warehouse
A structured, organized repository of cleaned and processed data optimized for analysis and reporting. Unlike data lakes, data warehouses store data in defined schemas.
Data Lake
A centralized repository that stores vast amounts of raw data in its native format until needed. Data lakes accept structured, semi-structured, and unstructured data at any scale.
Data Preprocessing
The process of cleaning, transforming, and organizing raw data into a format suitable for machine learning. This includes handling missing values, encoding categories, scaling features, and removing outliers.
Data Engineering
The practice of designing, building, and maintaining the systems and infrastructure that collect, store, and prepare data for analysis and machine learning.