Training Data
The dataset used to teach a machine learning model. It contains examples (and often labels) that the model learns patterns from during the training process. The quality and quantity of training data directly impact model performance.
Why It Matters
Garbage in, garbage out — training data quality is often the single biggest factor in model success. Biased or incomplete data leads to biased or unreliable models.
Example
ImageNet, a dataset of 14 million labeled images across 20,000 categories, used to train many breakthrough computer vision models.
Think of it like...
Like the textbooks and practice problems a student uses to learn — better study materials lead to better understanding and test performance.
Related Terms
Data Labeling
The process of assigning meaningful tags or annotations to raw data so it can be used for supervised learning. Labels tell the model what the correct answer should be for each training example.
Data Augmentation
Techniques for artificially expanding a training dataset by creating modified versions of existing data. This helps models generalize better, especially when training data is limited.
Synthetic Data
Artificially generated data that mimics the statistical properties and patterns of real data. It is created using algorithms, simulations, or generative models rather than collected from real-world events.
Test Data
A separate portion of data held back from training that is used to evaluate a model's performance on unseen examples. Test data provides an unbiased estimate of how well the model will perform in the real world.
Validation Data
A subset of data used during training to tune hyperparameters and monitor model performance without touching the test set. It acts as an intermediate checkpoint between training and final evaluation.