Inference
The process of using a trained model to make predictions on new, previously unseen data. Inference is what happens when an AI model is deployed and actively serving results to users.
Why It Matters
Inference speed and cost determine the viability of AI applications in production. A model that is accurate but too slow or expensive to run is impractical.
Example
When you type a query into ChatGPT and receive a response, the model is performing inference — applying its learned knowledge to your specific input.
Think of it like...
Like the difference between studying for a test (training) and taking the test (inference) — you use what you learned to answer new questions.
Related Terms
Latency
The time delay between sending a request to an AI model and receiving the response. In ML systems, latency includes data preprocessing, model inference, and network transmission time.
Throughput
The number of requests or predictions a model can process in a given time period. High throughput means the system can serve many users simultaneously.
Model Serving
The infrastructure and process of deploying trained ML models to production where they can receive requests and return predictions in real time. It includes scaling, load balancing, and version management.
Edge Inference
Running AI models directly on local devices (phones, IoT sensors, cameras) rather than sending data to the cloud. This reduces latency, preserves privacy, and works without internet connectivity.