Data Lake
A centralized repository that stores vast amounts of raw data in its native format until needed. Data lakes accept structured, semi-structured, and unstructured data at any scale.
Why It Matters
Data lakes are where most enterprise AI projects source their training data. They provide the raw material that data pipelines transform into ML-ready datasets.
Example
An enterprise data lake on AWS S3 storing clickstream logs, customer records, PDF documents, images, and IoT sensor data — all in their original formats.
Think of it like...
Like a reservoir that collects water from many streams — the water (data) flows in from everywhere and is available for different uses (analytics, ML, reporting).
Related Terms
Data Warehouse
A structured, organized repository of cleaned and processed data optimized for analysis and reporting. Unlike data lakes, data warehouses store data in defined schemas.
Data Pipeline
An automated workflow that extracts data from sources, transforms it through processing steps, and loads it into a destination for use. In ML, data pipelines ensure consistent data flow from raw sources to model training.
Data Engineering
The practice of designing, building, and maintaining the systems and infrastructure that collect, store, and prepare data for analysis and machine learning.
Data Governance
The overall management of data availability, usability, integrity, and security in an organization. It includes policies, standards, and practices for how data is collected, stored, and used.
Cloud Computing
On-demand access to computing resources (servers, storage, databases, AI services) over the internet. Cloud providers like AWS, Azure, and GCP offer scalable infrastructure without owning physical hardware.