Prometheus is an open-source monitoring and alerting toolkit designed for recording real-time metrics in a time-series database (TSDB), built especially for cloud-native and container-based environments like Kubernetes.
It collects and stores metrics as time series data
(i.e., values with timestamps), supports powerful querying via PromQL,
and integrates with Grafana and Alertmanager.
Key
Features of Prometheus:
|
Feature |
Description |
|
Time Series Storage |
Stores metrics as time series with labels |
|
Pull-based model |
Prometheus scrapes metrics from targets,
unlike push-based systems |
|
PromQL |
Built-in query language for filtering,
calculations, and alert conditions |
|
Service Discovery |
Automatically finds services via Kubernetes,
Consul, EC2, etc. |
|
Visualization |
Built-in graph UI; best used with Grafana
dashboards |
|
Alerting |
Define alerts based on thresholds; send
notifications via Alertmanager |
Why Use Prometheus?
- Open
source and widely used
- Lightweight
and easy to install
- Perfect
for microservices and containerized apps
- Strong
community and support ecosystem
- Compatible with exporters for system, database, application monitoring
Prometheus Components?
Prometheus has several core components that work together to
collect, store, query, and alert based on metrics.
|
Component |
Description |
|
Prometheus
Server |
Core component that collects (scrapes), stores,
and queries metrics |
|
Exporters |
Services or agents that expose metrics in
Prometheus format |
|
Alerting
Rules |
PromQL-based rules to define alert conditions |
|
Alertmanager |
Manages alerts – grouping, silencing, routing to
email, Slack, etc. |
|
Pushgateway |
Allows short-lived jobs to push metrics to
Prometheus |
|
Service
Discovery |
Auto-discovers scrape targets (e.g., Kubernetes
pods, EC2 instances) |
|
Prometheus
UI |
Built-in web interface to run queries, check
targets, and view alerts |
|
Grafana |
External tool for beautiful dashboards and
visualizations using Prometheus data |
Prometheus vs. Other Monitoring Tools.
|
Feature |
Prometheus |
Graphite |
Elastic Stack |
Datadog |
|
Architecture |
Pull-based |
Push & Pull |
Pull |
Agent-based |
|
Primary Data Model |
Time series |
Time series |
Logs |
Custom metrics |
|
Query Language |
PromQL |
Custom DSL |
kibana query
|
Custom UI |
|
Horizontal Scaling |
Supported |
Limited |
Supported |
Fully managed service |
|
Open source |
Yes |
Yes |
Yes |
No |
Basic
Terminologies in Prometheus
basic terminologies in Prometheus explained simply,
with practical context for better understanding:
|
Term |
Meaning |
|
Time Series |
Prometheus stores metric values with timestamps, so it creates time series data. |
|
Metric |
Metric is a measured value collected from a system, like CPU, memory, or request count. (e.g., http_requests_total) |
|
Label |
Key-value pair to differentiate time series |
|
Job |
Logical group of scrape targets |
|
Instance |
A single scrape target (usually host:port) |
|
Target |
Actual endpoint Prometheus scrapes metrics from |
|
Scraping |
Prometheus pulling data from targets at regular
intervals |
|
PromQL |
Prometheus Query Language to analyze and fetch
data |
|
Exporter |
Service or agent that exposes metrics in
Prometheus format |
|
Recording Rule |
Precomputed PromQL query result stored as a new
time series |
|
Alerting Rule |
PromQL expression that triggers alerts when
conditions are met |
|
Alertmanager |
Handles alerts: grouping, deduping, routing to
email, Slack, etc. |
|
Pushgateway |
Allows short-lived jobs to push metrics to
Prometheus |
|
Retention |
Duration Prometheus keeps time-series data (e.g.,
15 days) |
|
TSDB |
Time Series Database used internally by
Prometheus |
|
Service Discovery |
Automatically finds targets (Kubernetes, Consul,
EC2, etc.) |
|
Histogram |
Metric type that counts observations in
configurable buckets |
|
Summary |
Metric type similar to histogram but provides
quantile estimation |
|
Gauge |
Metric that can go up or down (e.g., memory
usage, temperature) |
|
Counter |
Monotonically increasing metric (e.g., number of
requests) |
|
Label Set |
Complete set of labels attached to a metric time
series |
|
Expression Browser |
Prometheus web UI for querying and visualizing
time series data |
Architecture
of Prometheus.
✅ 1. Time Series
- What:
A series of metric values tracked over time.
- Example:
CPU usage of a server every 10 seconds.
✅ 2. Metric
- What:
The actual measurement name.
- Example:
http_requests_total (total HTTP requests)
✅ 3. Label
- What:
Key-value pair to give more information about a metric.
- Example:
http_requests_total{method="GET", status="200"}
method and status are labels.
✅ 4. Job
- What:
A group of similar targets.
- Example:
All Node Exporters can be under job node.
✅ 5. Instance
- What:
A single target with address (host:port).
- Example:
10.1.2.3:9100 for one Node Exporter.
✅ 6. Target
- What:
The actual endpoint Prometheus collects data from.
- Includes:
job, instance, labels.
✅ 7. Scraping
- What:
The process where Prometheus collects data from a target.
✅ 8. PromQL
- What:
Prometheus Query Language.
- Use:
To filter, calculate, and display data.
- Example:
rate(http_requests_total[5m])
✅ 9. Exporter
- What:
A tool that exposes metrics in a Prometheus-readable format.
- Example:
- node_exporter
for Linux server metrics
- mysqld_exporter
for MySQL metrics
✅ 10. Recording Rule
- What:
Saves the result of a query as a new metric.
- Why:
Reduces query load and speeds up dashboards.
✅ 11. Alerting Rule
- What:
A rule that defines when an alert should fire.
- Example:
Alert when CPU usage > 90% for 5 minutes.
✅ 12. Alertmanager
- What:
Manages alerts – sends them via Email, Slack, etc.
- Also
handles:
- Grouping
- Silencing
- Routing
✅ 13. Pushgateway
- What:
Allows short-lived jobs to send data to Prometheus.
- Why
needed: Those jobs finish before Prometheus can scrape them.
✅ 14. Retention
- What:
How long Prometheus stores data.
- Default:
15 days (can be customized)
✅ 15. TSDB (Time Series DB)
- What:
Internal database where Prometheus stores its data.
- Supports:
Fast reads, writes, and compression.
✅ 16. Service Discovery
- What:
Automatically detects targets (like Kubernetes pods, EC2).
- Benefit:
No need to manually add each new server.
✅ 17. Histogram
- What:
Metric type that counts values in buckets.
- Used
for: Request duration, response sizes.
✅ 18. Summary
- What:
Like histogram but shows percentiles (e.g., 95th percentile).
- Used
for: Latency measurement.
✅ 19. Gauge
- What:
Metric that goes up and down.
- Examples:
Memory usage, temperature.
✅ 20. Counter
- What:
Only increases.
- Examples:
Total HTTP requests, total errors.
✅ 21. Label Set
- What:
All labels assigned to a time series.
- Helps
to: Identify and group metrics.
✅ 22. Expression Browser
- What:
Built-in UI in Prometheus to test and run queries.
- Use:
For debugging or checking real-time metrics.
