Grafana Stack Setup
Notes on standing up a Grafana-based monitoring stack for infrastructure. The core components are Grafana for dashboards, Prometheus for metrics, Loki for logs, Alertmanager for routing alerts, and a handful of exporters to get data in.
Architecture
At a minimum you’ve got Prometheus scraping metrics from exporters, Grafana querying Prometheus for dashboards, and Alertmanager handling any alert rules that fire. Adding Loki gives you logs in the same UI. Exporters do the actual data collection: SNMP for network devices, Node Exporter for Linux hosts, Windows Exporter for Windows, and community or bespoke exporters for anything else.
Prometheus will usually be the primary data source, but Grafana talks to plenty of others if you need them — InfluxDB, Elasticsearch, Azure Monitor, CloudWatch.
Core Components
Prometheus
Prometheus scrapes metrics from configured targets at regular intervals.
Basic prometheus.yml configuration:
global:
scrape_interval: 15s
evaluation_interval: 15s
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
# Load rules once and periodically evaluate them
rule_files:
- "alerts/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'snmp'
metrics_path: /snmp
params:
module: [if_mib]
static_configs:
- targets:
- 192.168.1.1 # Network devices
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9116 Grafana
Install and configure Grafana for visualization.
Add Prometheus data source:
- Navigate to Configuration → Data Sources
- Add Prometheus
- Set URL:
http://localhost:9090 - Save & Test
Loki (Optional)
Loki provides log aggregation similar to Elasticsearch but designed for Grafana.
Use cases:
- Application logs
- System logs (syslog)
- Authentication logs (Active Directory, etc.)
- Container logs
Configuration example:
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2020-05-15
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 168h
storage_config:
boltdb:
directory: /loki/index
filesystem:
directory: /loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h Common Exporters
SNMP Exporter
For network devices (switches, routers, firewalls).
See: Prometheus SNMP Exporter Configuration
Node Exporter
For Linux server metrics (CPU, memory, disk, network).
# Install
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
sudo cp node_exporter-*/node_exporter /usr/local/bin/
# Systemd service
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter Windows Exporter
For Windows server metrics.
Download from: https://github.com/prometheus-community/windows_exporter
Community and custom exporters
The Prometheus exporters page lists officially maintained and community-contributed exporters for most databases, message brokers, hypervisors, and cloud platforms you’re likely to need. If there isn’t one for your system, Prometheus exposes a simple HTTP text format so it’s straightforward to write your own in the language you prefer.
Alert Configuration
AlertManager Setup
alertmanager.yml example:
global:
resolve_timeout: 5m
route:
receiver: 'email-notifications'
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receivers:
- name: 'email-notifications'
email_configs:
- to: 'alerts@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'password' Common Alert Rules
prometheus/alerts/infrastructure.yml:
groups:
- name: infrastructure
interval: 30s
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} has been down for more than 5 minutes."
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 85% for more than 10 minutes."
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space low on {{ $labels.instance }}"
- alert: InterfaceDown
expr: ifOperStatus{job="snmp"} == 2
for: 2m
labels:
severity: warning
annotations:
summary: "Interface {{ $labels.ifDescr }} down on {{ $labels.instance }}" Network Device Monitoring
For comprehensive network monitoring, combine:
- SNMP Exporter - Metrics from switches, routers, firewalls
- Grafana Dashboards - Visualization of bandwidth, errors, status
- AlertManager - Notifications for interface down, high utilization
Example monitoring setup:
| Device Type | Metrics Collected | Module |
|---|---|---|
| Cisco Switches | CPU, Memory, Temp, Interfaces | cisco_devices, if_mib |
| FortiGate Firewalls | CPU, Sessions, Policies, VPN | fortigate_devices, if_mib |
| MikroTik Routers | Interfaces, Routes, Wireless | if_mib |
| Ubiquiti APs | Clients, Signal, Bandwidth | ubnt_devices |
| UPS Devices | Battery, Load, Runtime | ups_mib |
Dashboard design
Put the most important metrics at the top and group related ones together. Sync time ranges across related panels so you’re comparing apples to apples. Use colours and thresholds to surface when something’s wrong without requiring the viewer to read every number. Link panels to detailed views so a dashboard is a starting point for investigation, not a dead end.
For panel types: time series for anything changing over time (bandwidth, CPU, memory), stat panels for single current values, gauges for current percentages, tables for inventories, and status panels for up/down indicators.
Docker Compose Stack
docker-compose.yml example:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
ports:
- 9090:9090
grafana:
image: grafana/grafana:latest
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
ports:
- 3000:3000
loki:
image: grafana/loki:latest
volumes:
- ./loki:/etc/loki
ports:
- 3100:3100
command: -config.file=/etc/loki/local-config.yaml
alertmanager:
image: prom/alertmanager:latest
volumes:
- ./alertmanager:/etc/alertmanager
command:
- '--config.file=/etc/alertmanager/config.yml'
ports:
- 9093:9093
snmp_exporter:
image: prom/snmp-exporter:latest
volumes:
- ./snmp_exporter:/etc/snmp_exporter
command:
- '--config.file=/etc/snmp_exporter/snmp.yml'
ports:
- 9116:9116
volumes:
prometheus_data:
grafana_data: Performance notes
Scrape intervals in the 15–60s range cover most infrastructure use cases; going tighter than 15s costs more load than it’s worth unless you’ve got a specific reason. Set retention on Prometheus to something sensible for your actual needs (15–90 days is typical) and use recording rules or a long-term store for anything beyond that. Be picky about what metrics you actually collect — every exporter will happily give you a thousand metrics nobody’s ever going to look at. Give Prometheus enough CPU and RAM that ingestion doesn’t lag behind scraping, which shows up as gaps on dashboards.
Related Documentation
- Prometheus SNMP Exporter Configuration
- Switch Monitoring with CheckMK and SNMP
- Netbox - Network documentation and IPAM integration