Grafana Stack Setup

6 October 2025 · Updated 6 October 2025

Notes on standing up a Grafana-based monitoring stack for infrastructure. The core components are Grafana for dashboards, Prometheus for metrics, Loki for logs, Alertmanager for routing alerts, and a handful of exporters to get data in.

Architecture

At a minimum you’ve got Prometheus scraping metrics from exporters, Grafana querying Prometheus for dashboards, and Alertmanager handling any alert rules that fire. Adding Loki gives you logs in the same UI. Exporters do the actual data collection: SNMP for network devices, Node Exporter for Linux hosts, Windows Exporter for Windows, and community or bespoke exporters for anything else.

Prometheus will usually be the primary data source, but Grafana talks to plenty of others if you need them — InfluxDB, Elasticsearch, Azure Monitor, CloudWatch.

Core Components

Prometheus

Prometheus scrapes metrics from configured targets at regular intervals.

Basic prometheus.yml configuration:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

# Load rules once and periodically evaluate them
rule_files:
  - "alerts/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'snmp'
    metrics_path: /snmp
    params:
      module: [if_mib]
    static_configs:
      - targets:
        - 192.168.1.1  # Network devices
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9116

Grafana

Install and configure Grafana for visualization.

Add Prometheus data source:

  1. Navigate to Configuration → Data Sources
  2. Add Prometheus
  3. Set URL: http://localhost:9090
  4. Save & Test

Loki (Optional)

Loki provides log aggregation similar to Elasticsearch but designed for Grafana.

Use cases:

  • Application logs
  • System logs (syslog)
  • Authentication logs (Active Directory, etc.)
  • Container logs

Configuration example:

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
  - from: 2020-05-15
    store: boltdb
    object_store: filesystem
    schema: v11
    index:
      prefix: index_
      period: 168h

storage_config:
  boltdb:
    directory: /loki/index
  filesystem:
    directory: /loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

Common Exporters

SNMP Exporter

For network devices (switches, routers, firewalls).

See: Prometheus SNMP Exporter Configuration

Node Exporter

For Linux server metrics (CPU, memory, disk, network).

# Install
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
sudo cp node_exporter-*/node_exporter /usr/local/bin/

# Systemd service
sudo tee /etc/systemd/system/node_exporter.service &lt;<EOF
[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

Windows Exporter

For Windows server metrics.

Download from: https://github.com/prometheus-community/windows_exporter

Community and custom exporters

The Prometheus exporters page lists officially maintained and community-contributed exporters for most databases, message brokers, hypervisors, and cloud platforms you’re likely to need. If there isn’t one for your system, Prometheus exposes a simple HTTP text format so it’s straightforward to write your own in the language you prefer.

Alert Configuration

AlertManager Setup

alertmanager.yml example:

global:
  resolve_timeout: 5m

route:
  receiver: 'email-notifications'
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

receivers:
- name: 'email-notifications'
  email_configs:
  - to: 'alerts@example.com'
    from: 'alertmanager@example.com'
    smarthost: 'smtp.example.com:587'
    auth_username: 'alertmanager@example.com'
    auth_password: 'password'

Common Alert Rules

prometheus/alerts/infrastructure.yml:

groups:
- name: infrastructure
  interval: 30s
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} has been down for more than 5 minutes."

  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is above 85% for more than 10 minutes."

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"

  - alert: DiskSpaceLow
    expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 &lt; 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Disk space low on {{ $labels.instance }}"

  - alert: InterfaceDown
    expr: ifOperStatus{job="snmp"} == 2
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Interface {{ $labels.ifDescr }} down on {{ $labels.instance }}"

Network Device Monitoring

For comprehensive network monitoring, combine:

  1. SNMP Exporter - Metrics from switches, routers, firewalls
  2. Grafana Dashboards - Visualization of bandwidth, errors, status
  3. AlertManager - Notifications for interface down, high utilization

Example monitoring setup:

Device TypeMetrics CollectedModule
Cisco SwitchesCPU, Memory, Temp, Interfacescisco_devices, if_mib
FortiGate FirewallsCPU, Sessions, Policies, VPNfortigate_devices, if_mib
MikroTik RoutersInterfaces, Routes, Wirelessif_mib
Ubiquiti APsClients, Signal, Bandwidthubnt_devices
UPS DevicesBattery, Load, Runtimeups_mib

Dashboard design

Put the most important metrics at the top and group related ones together. Sync time ranges across related panels so you’re comparing apples to apples. Use colours and thresholds to surface when something’s wrong without requiring the viewer to read every number. Link panels to detailed views so a dashboard is a starting point for investigation, not a dead end.

For panel types: time series for anything changing over time (bandwidth, CPU, memory), stat panels for single current values, gauges for current percentages, tables for inventories, and status panels for up/down indicators.

Docker Compose Stack

docker-compose.yml example:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    ports:
      - 9090:9090

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    ports:
      - 3000:3000

  loki:
    image: grafana/loki:latest
    volumes:
      - ./loki:/etc/loki
    ports:
      - 3100:3100
    command: -config.file=/etc/loki/local-config.yaml

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/config.yml'
    ports:
      - 9093:9093

  snmp_exporter:
    image: prom/snmp-exporter:latest
    volumes:
      - ./snmp_exporter:/etc/snmp_exporter
    command:
      - '--config.file=/etc/snmp_exporter/snmp.yml'
    ports:
      - 9116:9116

volumes:
  prometheus_data:
  grafana_data:

Performance notes

Scrape intervals in the 15–60s range cover most infrastructure use cases; going tighter than 15s costs more load than it’s worth unless you’ve got a specific reason. Set retention on Prometheus to something sensible for your actual needs (15–90 days is typical) and use recording rules or a long-term store for anything beyond that. Be picky about what metrics you actually collect — every exporter will happily give you a thousand metrics nobody’s ever going to look at. Give Prometheus enough CPU and RAM that ingestion doesn’t lag behind scraping, which shows up as gaps on dashboards.

  • Prometheus SNMP Exporter Configuration
  • Switch Monitoring with CheckMK and SNMP
  • Netbox - Network documentation and IPAM integration