服务器指标监控

服务器监控相关文档教程。

📘 Prometheu + Grafana + Node Exporter

Prometheu + Grafana + Node Exporter + Alertmanager 主机监控

介绍

Prometheus 是一款开源的监控与告警系统,由 SoundCloud 在 2012 年开发并于 2015 年加入 CNCF(Cloud Native Computing Foundation)
它以强大的时序数据存储、灵活的查询语言(PromQL)和自动化的服务发现能力,广泛用于云原生环境中。

组件主要作用
Prometheus Server负责抓取(Scrape)监控数据、存储时序数据,并提供查询接口
Exporter将被监控目标的指标数据转换为 Prometheus 能读取的格式(如 Node Exporter
Pushgateway支持短生命周期任务将指标“推送”给 Prometheus(非推荐主流方案)
Alertmanager处理 Prometheus 发送的告警,支持分组、抑制、路由、通知(Email、Slack 等)

服务端安装

docker-compose 配置文件

docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    pull_policy: always
    ports:
      - "9090:9090"
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      - --storage.tsdb.retention.time=15d
      - --web.enable-lifecycle
      # 如需远程写入接收可开启:
      # - --web.enable-remote-write-receiver
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - prometheus-data:/prometheus
    depends_on:
      - alertmanager

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    pull_policy: always
    ports:
      - "9093:9093"
    command:
      - --config.file=/etc/alertmanager/alertmanager.yml
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager-data:/alertmanager

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    pull_policy: always
    ports:
      - "9100:9100"
    # 让容器内读取宿主机 /proc /sys 以采集宿主机指标
    pid: "host"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - --path.procfs=/host/proc
      - --path.sysfs=/host/sys
      - --path.rootfs=/rootfs
      - --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)

  grafana:
    image: grafana/grafana-oss:latest
    container_name: grafana
    restart: unless-stopped
    pull_policy: always
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  prometheus-data:
  alertmanager-data:
  grafana-data:

prometheus 配置文件

prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["prometheus:9090"]

  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]
        labels:
          instance: "Prometheus-Instance"
          nodename: "Prometheus-NodeName"

alertmanager 配置文件

alertmanager/alertmanager.yml
route:
  receiver: "null"

receivers:
  - name: "null"
    # 示例:改成邮箱/企业微信/钉钉/Slack/Webhook 等
    # email_configs:
    #   - to: "ops@example.com"
terminal
# 更新软件包列表
sudo apt update

# 安装 fail2ban
sudo apt install fail2ban -y

# 检查安装状态
sudo systemctl status fail2ban

# 设置开机自启动
sudo systemctl enable fail2ban

访问

客户端安装 Node Exporter 组件

terminal
#!/bin/bash
# ============================================================
# Node Exporter 安装脚本 - 适用于 Debian 13 / Ubuntu
# 版本:v1.9.1
# 作者:胖哥
# 站点:https://digvps.com/
# ============================================================

set -e

VERSION="1.9.1"
ARCH="linux-amd64"
DOWNLOAD_URL="https://github.com/prometheus/node_exporter/releases/download/v${VERSION}/node_exporter-${VERSION}.${ARCH}.tar.gz"

echo "📦 下载 Node Exporter v${VERSION} ..."
wget -q ${DOWNLOAD_URL} -O /tmp/node_exporter.tar.gz

echo "📂 解压文件 ..."
tar -xzf /tmp/node_exporter.tar.gz -C /tmp
cd /tmp/node_exporter-${VERSION}.${ARCH}

echo "🚀 安装二进制到 /usr/local/bin ..."
cp node_exporter /usr/local/bin/
chmod +x /usr/local/bin/node_exporter

echo "👤 创建 nodeusr 用户(如不存在)..."
if ! id "nodeusr" &>/dev/null; then
  useradd --no-create-home --shell /usr/sbin/nologin nodeusr
fi

echo "🧾 创建 systemd 服务文件 ..."
cat >/etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Prometheus Node Exporter
Documentation=https://github.com/prometheus/node_exporter
After=network-online.target

[Service]
User=nodeusr
Group=nodeusr
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --web.listen-address=":9100" 

Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

echo "🔄 重新加载 systemd 守护进程 ..."
systemctl daemon-reload

echo "▶️ 启动 Node Exporter 服务 ..."
systemctl enable --now node_exporter

echo "✅ Node Exporter 安装完成!"
echo "------------------------------------------------------------"
echo "访问地址: http://<服务器IP>:9100/metrics"
echo "服务状态: systemctl status node_exporter"
echo "日志查看: journalctl -u node_exporter -f"
echo "------------------------------------------------------------"

📘 Prometheu + Alertmanager

Prometheu + Alertmanager 警告配置

警告规则配置

服务器离线(Node Down)

prometheus/rules/node_down.yml
groups:
  - name: node_down
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "💀 节点 {{ $labels.instance }} 离线"
          description: "Exporter 无法连接(可能服务器宕机或网络故障)"

CPU 使用率过高

prometheus/rules/node_usage.yml
groups:
  - name: node_cpu
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "🔥 CPU 使用率过高 ({{ $labels.instance }})"
          description: "当前 CPU 使用率超过 90%"

内存使用率过高

prometheus/rules/node_ram.yml
groups:
  - name: node_memory
    rules:
      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "💾 内存使用率过高 ({{ $labels.instance }})"
          description: "当前内存使用率已超过 90%"

网络流量异常(例如上行/下行过高)

prometheus/rules/node_network.yml
groups:
  - name: node_memory
    rules:
      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "💾 内存使用率过高 ({{ $labels.instance }})"
          description: "当前内存使用率已超过 90%"

磁盘使用

prometheus/rules/node_disk.yml
groups:
  - name: node_disk
    rules:
      # 1) 磁盘使用率过高(Warning / Critical)
      - alert: DiskUsageHigh
        expr: |
          100 * (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs|devtmpfs|nsfs|tracefs|cgroup2.*|autofs|proc|sysfs|bpf|ramfs"}
                     / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs|devtmpfs|nsfs|tracefs|cgroup2.*|autofs|proc|sysfs|bpf|ramfs"}) > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "📦 磁盘使用率过高 ({{ $labels.instance }} {{ $labels.mountpoint }})"
          description: "磁盘使用率 > 90% 持续 5 分钟;当前值={{ $value | printf \"%.1f\" }}%"

      - alert: DiskUsageHigh
        expr: |
          100 * (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs|devtmpfs|nsfs|tracefs|cgroup2.*|autofs|proc|sysfs|bpf|ramfs"}
                     / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs|devtmpfs|nsfs|tracefs|cgroup2.*|autofs|proc|sysfs|bpf|ramfs"}) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "📦 磁盘使用率偏高 ({{ $labels.instance }} {{ $labels.mountpoint }})"
          description: "磁盘使用率 > 80% 持续 10 分钟;当前值={{ $value | printf \"%.1f\" }}%"

      # 2) inode 使用率过高(防止“有空间但没 inode”)
      - alert: InodeUsageHigh
        expr: |
          100 * (1 - node_filesystem_files_free{fstype!~"tmpfs|overlay|squashfs|devtmpfs|nsfs|tracefs|cgroup2.*|autofs|proc|sysfs|bpf|ramfs"}
                     / node_filesystem_files{fstype!~"tmpfs|overlay|squashfs|devtmpfs|nsfs|tracefs|cgroup2.*|autofs|proc|sysfs|bpf|ramfs"}) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "📁 inode 使用率偏高 ({{ $labels.instance }} {{ $labels.mountpoint }})"
          description: "inode 使用率 > 80% 持续 10 分钟;当前值={{ $value | printf \"%.1f\" }}%"

Alertmanager 通知配置

/root/monitoring/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

  # 邮件全局配置
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alert@example.com'
  smtp_auth_username: 'alert@example.com'
  smtp_auth_password: 'your_password_here'

route:
  # 根路由定义
  receiver: 'telegram'                 # 👈 默认分发到 all 这个 receiver
  group_by: ['alertname']         # 按告警名分组
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h

  # 这里可以做条件分发(示例)
  # routes:
  #   - match:
  #       severity: critical
  #     receiver: 'telegram'
  #   - match:
  #       severity: warning
  #     receiver: 'email'

receivers:
  # 🔔 同时发送邮件 + Telegram 的复合接收器
  - name: 'all'
    email_configs:
      - to: 'admin@example.com'
        send_resolved: true
    telegram_configs:
      - bot_token: '123456789:ABCDEF_xxxxx'
        chat_id: -1001234567890
        parse_mode: 'HTML'
        send_resolved: true
        message: |-
          🚨 <b>{{ .Status | toUpper }}</b> - {{ .CommonLabels.alertname }}
          主机: {{ .CommonLabels.instance }}
          严重性: {{ .CommonLabels.severity }}
          详情: {{ .CommonAnnotations.summary }}
          描述: {{ .CommonAnnotations.description }}
          <i>触发时间: {{ .StartsAt }}</i>

  # 你也可以定义单独渠道备用
  - name: 'email'
    email_configs:
      - to: 'ops@example.com'
        send_resolved: true

  - name: 'telegram'
    telegram_configs:
      - bot_token: '123456789:ABCDEF_xxxxx'
        chat_id: -1001234567890
        parse_mode: 'Markdown'
        send_resolved: true
        message: |-
          🚨 *{{ .Status | toUpper }}* - {{ .CommonLabels.alertname }}

          主机: {{ .CommonLabels.instance }}
          严重性: {{ .CommonLabels.severity }}
          详情: {{ .CommonAnnotations.summary }}
          描述: {{ .CommonAnnotations.description }}

          触发时间: {{ .StartsAt }}

测试验证

重新加载 Prometheus

terminal
curl -X POST http://<prometheus-host>:9090/-/reload

测试 Alertmanager 发送警告

terminal
# 发送
curl -X POST http://localhost:9093/api/v2/alerts \
  -H 'Content-Type: application/json' \
  -d '[
    {
      "labels": {
        "alertname": "ManualTest",
        "severity": "warning",
        "instance": "test-node"
      },
      "annotations": {
        "summary": "Manual test from curl",
        "description": "Verifying email + Telegram routes."
      },
      "startsAt": "'$(date -Is)'",
      "endsAt":   "'$(date -Is -d "+10 minutes")'"
    }
  ]'
  
  
# 恢复
curl -X POST http://localhost:9093/api/v2/alerts \
  -H 'Content-Type: application/json' \
  -d '[
    {
      "labels": { "alertname": "ManualTest", "severity": "warning", "instance": "test-node" },
      "startsAt": "'$(date -Is -d "-2 minutes")'",
      "endsAt":   "'$(date -Is -d "-1 minutes")'"
    }
  ]'

📘 Nezha Server Monitoring

哪吒监控,开源、轻量、易用的服务器监控与运维工具。

视频教程

请务必结合视频教程食用,因为细节内容都在视频中讲解,本文主要供小伙伴们拷贝代码用途。

查看视频:油管 B站

旧版视频:油管 B站 文档

哪吒面板

docker-compose配置文件

yml
services:
  dashboard:
    image: ghcr.io/nezhahq/nezha
    container_name: nezha-dashboard
    restart: always
    volumes:
      - ./data:/dashboard/data
    ports:
      - 8008:8008

面板配置文件

yml
debug: false
realipheader: ""
language: zh_CN
sitename: DigVPS.COM
jwtsecretkey: 75VV5b9jtTGCktY8XuoK0BhCp2hMcMEVP9XXk3WVUf0PEpyYvFUWOxXGczyWVDCvwUVvzusZL54AvZfdkjmzU45f1lJ64zjr0uNasJ8KsCDlHkQN3ODRstVojGC1S4WRcQb2S3BZj5mZVVRjb0GZZdmFybpmx7DSZJUtGIftkRmGvEywDTepUXEpysMaAulVrdkI920Zt7YZhkAdsc3qMw1hpUD6r8q0ERWugdkf1BjTBHFtHTYPka7lri7HQcdRRIB11f5pmbejjBtVwfzV4lM8eTaz0j0SwKMC2le3SejoriHvcH3sbnhfuGJY9ZfmJKnhACBllxt9NuQjDFcstLztNi79aT5wDsrwHmFS8N7CriXwhyR0DdFRQiitX0tWp4X7SLhYyiLuqGgq4bmNlIkGIKdmcFupDT3YA8Pi0qgVnPTFA2nCRyYfCgCkzRb7M4Gym9EaaSrp5gHJGo5uyOh81iXNkJSlyXH1kwc7MAqrLD5gq3jpSF54jciNy0yGtQTNCh98Nz3qeWGw9bT0lOAcSEtnZlvKNc4fvaBFU3c9Js1V4B1pTGFjdZJvVRaEuD065kkORtxR6eaKmo5NBv5qNk32lsxcCaOiuYNMCHFtGbUWGmCKct3rtk6kzh0lGfImYlHzo2xu0IiytAs11FDzUE7fT1yugf3wcJ2GboDol8r12anMgleHZevFx8LI9O3Gf3UgkbIaqHVYc7njTl41r489wte7vuXur2A0dyv5MSR8PJ0TeLdWsSbLVHxfkZ0yYM5HAChnGInCkkgPE3DFfG6ukjQmpu3m3KGK0JMfHqbg1XjA7gVVCFcImZ1iJSbhK77N17fkN8HErNt5Dbqp4tJ74RWy3N1bcKDki3YODeU64fQudHqv4U7EDpy3IIEBChGLXcXEl7ZkJDE7CmY5cbfCCA7zALHdcGVcCU3sW0l1B4coYRqYJPPA1nnLzUdZUwsJoT3GpkfOMdx9tgQcMZVuVDdmqjtbBpkZ1GsqftKY6D3DqavEcj2vjEqN
agentsecretkey: WYGjFRqQPhcBCfmfPoXMjXUNIanxceKw
listenport: 8008
listenhost: ""
installhost: DigVPS.COM #客户端需要连接的服务器域名或者IP,请改为你自己的。
tls: true
location: Asia/Shanghai
enableplainipinnotification: false
enableipchangenotification: false
ipchangenotificationgroupid: 0
cover: 1
ignoredipnotification: ""
ignoredipnotificationserverids: {}
avgpingcount: 2
dnsservers: ""
customcode: ""
customcodedashboard: ""

反代配置

underscores_in_headers on;
ignore_invalid_headers off;

location /dashboard {
    proxy_pass http://$server:$port;
    proxy_set_header Host $http_host;
    proxy_set_header      Upgrade $http_upgrade;
}
# websocket 相关
location ~* ^/api/v1/ws/(server|terminal|file)(.*)$ {
    proxy_set_header Host $host;
    proxy_set_header nz-realip $remote_addr;
    proxy_set_header Origin https://$host;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 3600s;
    proxy_send_timeout 3600s;
    proxy_pass http://$server:$port;
}
# grpc 相关    
location ^~ /proto.NezhaService/ {
    grpc_set_header Host $host;
    grpc_set_header nz-realip $remote_addr;
    grpc_set_header client_secret $http_client_secret;
    grpc_set_header client_uuid $http_client_uuid;
    grpc_read_timeout 600s;
    grpc_send_timeout 600s;
    grpc_socket_keepalive on;
    client_max_body_size 10m;
    grpc_buffer_size 4m;
    grpc_pass grpc://$server:$port;
}

延迟监控IP

地区电信移动联通教育
上海202.96.209.133221.183.90.237210.22.97.1202.120.2.119
北京49.7.37.74112.34.111.194111.206.209.44101.6.15.66
广州183.47.126.35120.233.18.250157.148.58.29202.116.64.8
深圳218.17.11.168120.196.165.2458.250.90.114
河北27.185.242.215111.62.229.10061.182.138.156
山西1.71.157.41183.201.244.9160.221.18.41
辽宁123.184.58.4136.131.156.145218.61.211.132
吉林123.172.127.217111.27.127.176122.143.8.41
黑龙江42.101.84.132111.42.190.25113.7.211.140
江苏58.215.210.22036.156.92.132122.96.235.165
浙江115.220.14.91117.147.213.41101.69.194.224
安徽223.247.108.251112.29.198.100112.132.208.41
福建106.126.10.28112.50.96.8836.248.48.139
江西106.227.22.132117.168.150.249116.153.69.224
山东144.123.160.140120.220.145.91112.240.56.143
河南171.15.110.220111.7.99.220123.6.65.101
湖北111.170.8.60111.47.131.101122.189.226.138
湖南113.240.117.108120.226.192.91116.162.28.220
广东183.36.23.111183.240.65.191112.90.211.100
海南124.225.43.220111.29.29.219153.0.226.35
四川118.123.218.220183.220.151.41101.206.163.49
贵州58.42.61.13261.243.18.220117.187.254.132
云南222.221.102.22036.147.44.21914.204.150.41
陕西124.115.14.100111.19.148.100123.139.127.132
甘肃118.182.228.91117.157.16.4159.81.94.53
青海223.221.216.219111.12.152.170116.177.237.137
内蒙古110.76.186.70117.161.76.41116.114.98.41
广西222.217.93.5536.136.112.41171.39.5.51
西藏113.62.176.89117.180.234.4143.242.165.35
宁夏222.75.44.220111.51.155.214116.129.226.28
新疆110.157.243.4536.189.208.164116.178.77.40
天津42.81.98.35111.31.236.35116.78.119.56
重庆119.84.131.101221.178.81.101221.7.92.98

Copyright © 2025. All rights reserved.