DigVPS

服务器指标监控

服务器监控相关文档教程。

📘 Prometheu + Grafana + Node Exporter

Prometheu + Grafana + Node Exporter 主机监控

结合视频教程食用 油管 B站

介绍

Prometheus 是一款开源的监控与告警系统,由 SoundCloud 在 2012 年开发并于 2015 年加入 CNCF(Cloud Native Computing Foundation)
它以强大的时序数据存储、灵活的查询语言(PromQL)和自动化的服务发现能力,广泛用于云原生环境中。

组件主要作用
Prometheus Server负责抓取(Scrape)监控数据、存储时序数据,并提供查询接口
Exporter将被监控目标的指标数据转换为 Prometheus 能读取的格式(如 Node Exporter
Pushgateway支持短生命周期任务将指标“推送”给 Prometheus(非推荐主流方案)
Alertmanager处理 Prometheus 发送的告警,支持分组、抑制、路由、通知(Email、Slack 等)

服务端安装

docker-compose 配置文件

docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    pull_policy: always
    ports:
      - "9090:9090"
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      - --storage.tsdb.retention.time=15d
      - --web.enable-lifecycle
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - prometheus-data:/prometheus
    depends_on:
      - alertmanager
      - blackbox-exporter

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    pull_policy: always
    ports:
      - "9093:9093"
    command:
      - --config.file=/etc/alertmanager/alertmanager.yml
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager-data:/alertmanager

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    pull_policy: always
    ports:
      - "9100:9100"
    # 让容器内读取宿主机 /proc /sys 以采集宿主机指标
    pid: "host"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - --path.procfs=/host/proc
      - --path.sysfs=/host/sys
      - --path.rootfs=/rootfs
      - --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)

  blackbox-exporter:
    image: prom/blackbox-exporter:latest
    container_name: blackbox-exporter
    restart: unless-stopped
    pull_policy: always
    ports:
      - "9115:9115"
    volumes:
      - ./blackbox/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
    command:
      - --config.file=/etc/blackbox_exporter/config.yml

  grafana:
    image: grafana/grafana-oss:latest
    container_name: grafana
    restart: unless-stopped
    pull_policy: always
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  prometheus-data:
  alertmanager-data:
  grafana-data:

prometheus 配置文件

prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["prometheus:9090"]

  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]
        labels:
          instance: "Prometheus-Instance"
          nodename: "Prometheus-NodeName"

alertmanager 配置文件

alertmanager/alertmanager.yml
route:
  receiver: "null"

receivers:
  - name: "null"
    # 示例:改成邮箱/企业微信/钉钉/Slack/Webhook 等
    # email_configs:
    #   - to: "ops@example.com"

访问

客户端安装 Node Exporter 组件

terminal
#!/bin/bash
# ============================================================
# Node Exporter 安装脚本 - 适用于 Debian 13 / Ubuntu
# 版本:v1.9.1
# 作者:胖哥
# 站点:https://digvps.com/
# ============================================================

set -e

VERSION="1.10.2"
ARCH="linux-amd64"
DOWNLOAD_URL="https://github.com/prometheus/node_exporter/releases/download/v${VERSION}/node_exporter-${VERSION}.${ARCH}.tar.gz"

echo "📦 下载 Node Exporter v${VERSION} ..."
wget -q ${DOWNLOAD_URL} -O /tmp/node_exporter.tar.gz

echo "📂 解压文件 ..."
tar -xzf /tmp/node_exporter.tar.gz -C /tmp
cd /tmp/node_exporter-${VERSION}.${ARCH}

echo "🚀 安装二进制到 /usr/local/bin ..."
cp node_exporter /usr/local/bin/
chmod +x /usr/local/bin/node_exporter

echo "👤 创建 nodeusr 用户(如不存在)..."
if ! id "nodeusr" &>/dev/null; then
  useradd --no-create-home --shell /usr/sbin/nologin nodeusr
fi

echo "🧾 创建 systemd 服务文件 ..."
cat >/etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Prometheus Node Exporter
Documentation=https://github.com/prometheus/node_exporter
After=network-online.target

[Service]
User=nodeusr
Group=nodeusr
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --web.listen-address=":9100" 

Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

echo "🔄 重新加载 systemd 守护进程 ..."
systemctl daemon-reload

echo "▶️ 启动 Node Exporter 服务 ..."
systemctl enable --now node_exporter

echo "✅ Node Exporter 安装完成!"
echo "------------------------------------------------------------"
echo "访问地址: http://<服务器IP>:9100/metrics"
echo "服务状态: systemctl status node_exporter"
echo "日志查看: journalctl -u node_exporter -f"
echo "------------------------------------------------------------"

📘 Prometheu + Alertmanager

Prometheu + Alertmanager 警告配置

警告规则配置

服务器离线(Node Down)

prometheus/rules/node_down.yml
groups:
  - name: node_down
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "💀 节点 {{ $labels.instance }} 离线"
          description: "Exporter 无法连接(可能服务器宕机或网络故障)"

CPU 使用率过高

prometheus/rules/node_usage.yml
groups:
  - name: node_cpu
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "🔥 CPU 使用率过高 ({{ $labels.instance }})"
          description: "当前 CPU 使用率超过 90%"

内存使用率过高

prometheus/rules/node_ram.yml
groups:
  - name: node_memory
    rules:
      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "💾 内存使用率过高 ({{ $labels.instance }})"
          description: "当前内存使用率已超过 90%"

网络流量异常(例如上行/下行过高)

prometheus/rules/node_network.yml
groups:
  - name: node_memory
    rules:
      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "💾 内存使用率过高 ({{ $labels.instance }})"
          description: "当前内存使用率已超过 90%"

磁盘使用

prometheus/rules/node_disk.yml
groups:
  - name: node_disk
    rules:
      # 1) 磁盘使用率过高(Warning / Critical)
      - alert: DiskUsageHigh
        expr: |
          100 * (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs|devtmpfs|nsfs|tracefs|cgroup2.*|autofs|proc|sysfs|bpf|ramfs"}
                     / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs|devtmpfs|nsfs|tracefs|cgroup2.*|autofs|proc|sysfs|bpf|ramfs"}) > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "📦 磁盘使用率过高 ({{ $labels.instance }} {{ $labels.mountpoint }})"
          description: "磁盘使用率 > 90% 持续 5 分钟;当前值={{ $value | printf \"%.1f\" }}%"

      - alert: DiskUsageHigh
        expr: |
          100 * (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs|devtmpfs|nsfs|tracefs|cgroup2.*|autofs|proc|sysfs|bpf|ramfs"}
                     / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs|devtmpfs|nsfs|tracefs|cgroup2.*|autofs|proc|sysfs|bpf|ramfs"}) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "📦 磁盘使用率偏高 ({{ $labels.instance }} {{ $labels.mountpoint }})"
          description: "磁盘使用率 > 80% 持续 10 分钟;当前值={{ $value | printf \"%.1f\" }}%"

      # 2) inode 使用率过高(防止“有空间但没 inode”)
      - alert: InodeUsageHigh
        expr: |
          100 * (1 - node_filesystem_files_free{fstype!~"tmpfs|overlay|squashfs|devtmpfs|nsfs|tracefs|cgroup2.*|autofs|proc|sysfs|bpf|ramfs"}
                     / node_filesystem_files{fstype!~"tmpfs|overlay|squashfs|devtmpfs|nsfs|tracefs|cgroup2.*|autofs|proc|sysfs|bpf|ramfs"}) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "📁 inode 使用率偏高 ({{ $labels.instance }} {{ $labels.mountpoint }})"
          description: "inode 使用率 > 80% 持续 10 分钟;当前值={{ $value | printf \"%.1f\" }}%"

Alertmanager 通知配置

/root/monitoring/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

  # 邮件全局配置
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alert@example.com'
  smtp_auth_username: 'alert@example.com'
  smtp_auth_password: 'your_password_here'

route:
  # 根路由定义
  receiver: 'telegram'                 # 👈 默认分发到 all 这个 receiver
  group_by: ['alertname']         # 按告警名分组
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h

  # 这里可以做条件分发(示例)
  # routes:
  #   - match:
  #       severity: critical
  #     receiver: 'telegram'
  #   - match:
  #       severity: warning
  #     receiver: 'email'

receivers:
  # 🔔 同时发送邮件 + Telegram 的复合接收器
  - name: 'all'
    email_configs:
      - to: 'admin@example.com'
        send_resolved: true
    telegram_configs:
      - bot_token: '123456789:ABCDEF_xxxxx'
        chat_id: -1001234567890
        parse_mode: 'HTML'
        send_resolved: true
        message: |-
          🚨 <b>{{ .Status | toUpper }}</b> - {{ .CommonLabels.alertname }}
          主机: {{ .CommonLabels.instance }}
          严重性: {{ .CommonLabels.severity }}
          详情: {{ .CommonAnnotations.summary }}
          描述: {{ .CommonAnnotations.description }}
          <i>触发时间: {{ .StartsAt }}</i>

  # 你也可以定义单独渠道备用
  - name: 'email'
    email_configs:
      - to: 'ops@example.com'
        send_resolved: true

  - name: 'telegram'
    telegram_configs:
      - bot_token: '123456789:ABCDEF_xxxxx'
        chat_id: -1001234567890
        parse_mode: 'Markdown'
        send_resolved: true
        message: |-
          🚨 *{{ .Status | toUpper }}* - {{ .CommonLabels.alertname }}

          主机: {{ .CommonLabels.instance }}
          严重性: {{ .CommonLabels.severity }}
          详情: {{ .CommonAnnotations.summary }}
          描述: {{ .CommonAnnotations.description }}

          触发时间: {{ .StartsAt }}

测试验证

重新加载 Prometheus

terminal
curl -X POST http://localhost:9090/-/reload

测试 Alertmanager 发送警告

terminal
# 发送
curl -X POST http://localhost:9093/api/v2/alerts \
  -H 'Content-Type: application/json' \
  -d '[
    {
      "labels": {
        "alertname": "ManualTest",
        "severity": "warning",
        "instance": "test-node"
      },
      "annotations": {
        "summary": "Manual test from curl",
        "description": "Verifying email + Telegram routes."
      },
      "startsAt": "'$(date -Is)'",
      "endsAt":   "'$(date -Is -d "+10 minutes")'"
    }
  ]'
  
  
# 恢复
curl -X POST http://localhost:9093/api/v2/alerts \
  -H 'Content-Type: application/json' \
  -d '[
    {
      "labels": { "alertname": "ManualTest", "severity": "warning", "instance": "test-node" },
      "startsAt": "'$(date -Is -d "-2 minutes")'",
      "endsAt":   "'$(date -Is -d "-1 minutes")'"
    }
  ]'

📘 Nezha Server Monitoring

哪吒监控,开源、轻量、易用的服务器监控与运维工具。

视频教程

请务必结合视频教程食用,因为细节内容都在视频中讲解,本文主要供小伙伴们拷贝代码用途。

查看视频:油管 B站

旧版视频:油管 B站 文档

哪吒面板

docker-compose配置文件

yml
services:
  dashboard:
    image: ghcr.io/nezhahq/nezha
    container_name: nezha-dashboard
    restart: always
    volumes:
      - ./data:/dashboard/data
    ports:
      - 8008:8008

面板配置文件

yml
debug: false
realipheader: ""
language: zh_CN
sitename: DigVPS.COM
jwtsecretkey: 75VV5b9jtTGCktY8XuoK0BhCp2hMcMEVP9XXk3WVUf0PEpyYvFUWOxXGczyWVDCvwUVvzusZL54AvZfdkjmzU45f1lJ64zjr0uNasJ8KsCDlHkQN3ODRstVojGC1S4WRcQb2S3BZj5mZVVRjb0GZZdmFybpmx7DSZJUtGIftkRmGvEywDTepUXEpysMaAulVrdkI920Zt7YZhkAdsc3qMw1hpUD6r8q0ERWugdkf1BjTBHFtHTYPka7lri7HQcdRRIB11f5pmbejjBtVwfzV4lM8eTaz0j0SwKMC2le3SejoriHvcH3sbnhfuGJY9ZfmJKnhACBllxt9NuQjDFcstLztNi79aT5wDsrwHmFS8N7CriXwhyR0DdFRQiitX0tWp4X7SLhYyiLuqGgq4bmNlIkGIKdmcFupDT3YA8Pi0qgVnPTFA2nCRyYfCgCkzRb7M4Gym9EaaSrp5gHJGo5uyOh81iXNkJSlyXH1kwc7MAqrLD5gq3jpSF54jciNy0yGtQTNCh98Nz3qeWGw9bT0lOAcSEtnZlvKNc4fvaBFU3c9Js1V4B1pTGFjdZJvVRaEuD065kkORtxR6eaKmo5NBv5qNk32lsxcCaOiuYNMCHFtGbUWGmCKct3rtk6kzh0lGfImYlHzo2xu0IiytAs11FDzUE7fT1yugf3wcJ2GboDol8r12anMgleHZevFx8LI9O3Gf3UgkbIaqHVYc7njTl41r489wte7vuXur2A0dyv5MSR8PJ0TeLdWsSbLVHxfkZ0yYM5HAChnGInCkkgPE3DFfG6ukjQmpu3m3KGK0JMfHqbg1XjA7gVVCFcImZ1iJSbhK77N17fkN8HErNt5Dbqp4tJ74RWy3N1bcKDki3YODeU64fQudHqv4U7EDpy3IIEBChGLXcXEl7ZkJDE7CmY5cbfCCA7zALHdcGVcCU3sW0l1B4coYRqYJPPA1nnLzUdZUwsJoT3GpkfOMdx9tgQcMZVuVDdmqjtbBpkZ1GsqftKY6D3DqavEcj2vjEqN
agentsecretkey: WYGjFRqQPhcBCfmfPoXMjXUNIanxceKw
listenport: 8008
listenhost: ""
installhost: DigVPS.COM #客户端需要连接的服务器域名或者IP,请改为你自己的。
tls: true
location: Asia/Shanghai
enableplainipinnotification: false
enableipchangenotification: false
ipchangenotificationgroupid: 0
cover: 1
ignoredipnotification: ""
ignoredipnotificationserverids: {}
avgpingcount: 2
dnsservers: ""
customcode: ""
customcodedashboard: ""

反代配置

underscores_in_headers on;
ignore_invalid_headers off;

location /dashboard {
    proxy_pass http://$server:$port;
    proxy_set_header Host $http_host;
    proxy_set_header      Upgrade $http_upgrade;
}
# websocket 相关
location ~* ^/api/v1/ws/(server|terminal|file)(.*)$ {
    proxy_set_header Host $host;
    proxy_set_header nz-realip $remote_addr;
    proxy_set_header Origin https://$host;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 3600s;
    proxy_send_timeout 3600s;
    proxy_pass http://$server:$port;
}
# grpc 相关    
location ^~ /proto.NezhaService/ {
    grpc_set_header Host $host;
    grpc_set_header nz-realip $remote_addr;
    grpc_set_header client_secret $http_client_secret;
    grpc_set_header client_uuid $http_client_uuid;
    grpc_read_timeout 600s;
    grpc_send_timeout 600s;
    grpc_socket_keepalive on;
    client_max_body_size 10m;
    grpc_buffer_size 4m;
    grpc_pass grpc://$server:$port;
}

延迟监控IP

地区电信移动联通教育
上海202.96.209.133221.183.90.237210.22.97.1202.120.2.119
北京49.7.37.74112.34.111.194111.206.209.44101.6.15.66
广州183.47.126.35120.233.18.250157.148.58.29202.116.64.8
深圳218.17.11.168120.196.165.2458.250.90.114
河北27.185.242.215111.62.229.10061.182.138.156
山西1.71.157.41183.201.244.9160.221.18.41
辽宁123.184.58.4136.131.156.145218.61.211.132
吉林123.172.127.217111.27.127.176122.143.8.41
黑龙江42.101.84.132111.42.190.25113.7.211.140
江苏58.215.210.22036.156.92.132122.96.235.165
浙江115.220.14.91117.147.213.41101.69.194.224
安徽223.247.108.251112.29.198.100112.132.208.41
福建106.126.10.28112.50.96.8836.248.48.139
江西106.227.22.132117.168.150.249116.153.69.224
山东144.123.160.140120.220.145.91112.240.56.143
河南171.15.110.220111.7.99.220123.6.65.101
湖北111.170.8.60111.47.131.101122.189.226.138
湖南113.240.117.108120.226.192.91116.162.28.220
广东183.36.23.111183.240.65.191112.90.211.100
海南124.225.43.220111.29.29.219153.0.226.35
四川118.123.218.220183.220.151.41101.206.163.49
贵州58.42.61.13261.243.18.220117.187.254.132
云南222.221.102.22036.147.44.21914.204.150.41
陕西124.115.14.100111.19.148.100123.139.127.132
甘肃118.182.228.91117.157.16.4159.81.94.53
青海223.221.216.219111.12.152.170116.177.237.137
内蒙古110.76.186.70117.161.76.41116.114.98.41
广西222.217.93.5536.136.112.41171.39.5.51
西藏113.62.176.89117.180.234.4143.242.165.35
宁夏222.75.44.220111.51.155.214116.129.226.28
新疆110.157.243.4536.189.208.164116.178.77.40
天津42.81.98.35111.31.236.35116.78.119.56
重庆119.84.131.101221.178.81.101221.7.92.98