Data Science

Data Lake

A centralized repository that stores vast amounts of raw data in its native format until needed. Data lakes accept structured, semi-structured, and unstructured data at any scale.

Why It Matters

Data lakes are where most enterprise AI projects source their training data. They provide the raw material that data pipelines transform into ML-ready datasets.

Example

An enterprise data lake on AWS S3 storing clickstream logs, customer records, PDF documents, images, and IoT sensor data — all in their original formats.

Think of it like...

Like a reservoir that collects water from many streams — the water (data) flows in from everywhere and is available for different uses (analytics, ML, reporting).

Related Terms