目前博客站点功能正常! 其他站点之间将会陆续进行微服务重构,若出现访问页面异常请人才们谅解! 欢迎访问我的B站、CSDN、GitHub、Gitee等平台,内容会不定期同步,相关页面地址可点击本站上方【友情链接】。thanks 人才们 ~

Prometheus+Grafana+Alertmanager监控集群部署

基础介绍

  • 部署方式:docker-compose
  • 相关功能:
    • prometheus:提供各种exporter,监控mysql、linux系统等可行性方案;
    • grafana:对prometheus监控收集到的数据进行可视化显示,官方提供多种监控数据可视化模板;
    • alertmanager:根据prometheus中定义的rule告警规则,实现监控告警消息推送,可支持钉钉、微信、邮箱、webhook自定义接口等多种方式;

监控对象

  • mysql:mysqld-exporter目前主流
  • linux【平台服务器】:node-exporter能够有效采集服务器cpu、内存、网络等监控数据;

部署流程

  • 在服务器根目录新建web-project文件夹,用于存放mysql高可用方案部署;并创建prometheus_grafana.yml,用于docker-compose集群部署
## 切换到根目录[服务器的最上层目录,常见有etc等文件夹]
cd ../
## 新建web-project 文件夹
mkdir web-project
## 切换到web-project目录
cd web-project
## 新建docker-compose集群启动文件
touch prometheus_grafana.yml

具体prometheus_grafana.yml配置如下:

version: '2'
services:
  # 添加 普罗米修斯服务
  prometheus:
    # Docker Hub 镜像
    image: prom/prometheus:latest
    # 容器名称
    container_name: prometheus
    # 容器内部 hostname
    hostname: prometheus
    # 容器支持自启动
    restart: always
    # 容器与宿主机 端口映射
    ports:
      - '9090:9090'
    networks:
      default:
        ipv4_address: 172.24.0.2
    # 将宿主机中的config文件夹,挂载到容器中/config文件夹
    volumes:
      - '/web-project/prometheus/config:/config'
      - '/web-project/prometheus/data/prometheus:/prometheus/data'
    # 指定容器中的配置文件
    command:
      ## 解决nginx代理路径问题
      - '--web.external-url=prometheus'
      - '--config.file=/config/prometheus.yml'
      # 支持热更新
      - '--web.enable-lifecycle'
 
  # 添加告警模块
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    hostname: alertmanager
    restart: always
    ports:
      - '9093:9093'
    networks:
      default:
        ipv4_address: 172.24.0.3
    volumes:
      - '/web-project/prometheus/config:/config'
      - '/web-project/prometheus/data/alertmanager:/alertmanager/data'
    command:
      - '--config.file=/config/alertmanager.yml'
 
  # 添加监控可视化面板
  grafana:
    image: grafana/grafana
    container_name: grafana
    hostname: grafana
    restart: always
    ports:
      - '3000:3000'
    networks:
      default:
        ipv4_address: 172.24.0.4
    volumes:
      # 配置grafana 邮件服务器
      - '/web-project/grafana/config/grafana.ini:/etc/grafana/grafana.ini'
      - '/web-project/grafana/data/grafana:/var/lib/grafana'
  # 添加node-exporter系统监控
  node-exporter:
    image: prom/node-exporter
    container_name: node-exporter
    hostname: node-exporter
    restart: always
    ports:
      - '9100:9100'
    networks:
      default:
        ipv4_address: 172.24.0.5
    volumes:
      # 配置grafana 邮件服务器
      - '/proc:/host/proc:ro'
      - '/sys:/host/sys:ro' 
      - '/:/rootfs:ro'
networks:
  ## 定义桥接网关
  default:
    external:
      name: pga-net

仔细阅读以上yaml配置文件,发现集群中使用的是已存在的网络pga-net,因此在启动集群之前,需要先手动创建docker的 network,创建操作如下:

## 创建名为pga-net的桥接网络
docker network create pga-net --subnet=172.24.0.0/24
## --subnet 网络子网网段,即ip地址划分从172.24.0.2~172.24.0.23,启动172.24.0.1为默认网关

## 其他基础network操作

## 列出当前机器所有的network
docker network  ls
## 移除网络pga-net
docker network rm pga-net
## 查看网络详细信息
docker network inspect pga-net

  • 启动Prometheus+Grafana+Alertmanager集群

注意:启动操作是基于docker-compose,请确保本地docker及docker-compose安装完成;

## 进入到/web-project目录下,确定prometheus_grafana.yml启动文件存在
cd /web-project
## 启动集群
docker-compose -f prometheus_grafana.yml up -d
## -f 指定集群启动以来的配置文件,不指定则默认docker-compose.yml
## -d 后台启动

## 集群重启
docker-compose -f prometheus_grafana.yml restart

## 集群启动【已创建的情况下】
docker-compose -f prometheus_grafana.yml start

## 集群停止
docker-compose -f prometheus_grafana.yml stop

## 启动全部删除【换删除yaml配置文件中提到的所有容器,操作不可逆】
docker-compose -f prometheus_grafana.yml down 

## 以上集群创建完成后关于容器的操作,可以类比docker,对于单个容器操作,用docker 指令进行restart、start、stop更为方便;

启动效果:

[root@hecs-x-medium-2-linux-20200516093019 web-project]# docker-compose -f prometheus_grafana.yml up -d
node-exporter is up-to-date
grafana is up-to-date
alertmanager is up-to-date
prometheus is up-to-date
### 每次集群up操作,若容器存在会进行更新update操作的;
[root@hecs-x-medium-2-linux-20200516093019 web-project]# docker-compose -f prometheus_grafana.yml ps
    Name                   Command               State           Ports         
-------------------------------------------------------------------------------
alertmanager    /bin/alertmanager --config ...   Up      0.0.0.0:9093->9093/tcp
grafana         /run.sh                          Up      0.0.0.0:3000->3000/tcp
node-exporter   /bin/node_exporter               Up      0.0.0.0:9100->9100/tcp
prometheus      /bin/prometheus --web.exte ...   Up      0.0.0.0:9090->9090/tcp

当然,最开始的集群启动的时候,可能由于本地映射文件不对应,会存在容器启动失败;这个时候,需要根据

docker logs <容器名或者id>

查看容器启动日志,已经报错信息,根据报错信息处理;

基础映射文件配置

  • prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
 
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['172.24.0.3:9093']
       # - alertmanager:9093
 
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
   - "*_rules.yml"
 
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  
 
#- job_name: 'promethus'
#  static_configs:
#     - targets: ['172.24.0.2:9090']
#       labels:
#        instance: 'Monitor-Service-01'
#        platform: 'master'
 
- job_name: '华为Linux'
  static_configs:
     - targets: ['172.24.0.5:9100']
       labels:
        instance: 'linux-server'
        platform: 'huawei-cloud'
 
## 监控数据库
- job_name: '华为MySQL'
  static_configs:
     - targets: ['172.24.0.11:9104']
       labels:
        instance: 'hauwei-mysql-main'
        platform: 'huawei-cloud'
  • alertmanager.yml
global:
  resolve_timeout: 1m
  # The smarthost and SMTP sender used for mail notifications.
  #smtp_smarthost: ''
  #smtp_from: ''
  #smtp_auth_username: ''
  #smtp_auth_password: ''
 
route:
  receiver: 'default-receiver'
    # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  #group_by: ['alertname']
 
  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  # group_wait: 5s
 
  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  # group_interval: 30s
 
  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 1m
receivers:
  - name: 'default-receiver'

可在此文件中配置告警消息推送方式,例如钉钉机器人、微信机器人、webhook自定义接口 等

  • mysql_rules.yml【mysql告警规则】
groups:
    - name: MySQL集群健康检测
      rules:
      - alert: MySQL Server Is UP
        expr: mysql_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} MySQL is down"
          description: "MySQL database is down. This requires immediate action!"
      - alert: IO thread stopped
        expr: mysql_slave_status_slave_io_running != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} IO thread stopped"
          description: "IO thread has stopped. This is usually because it cannot connect to the Master any more."
      - alert: SQL thread stopped 
        expr: mysql_slave_status_slave_sql_running == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} SQL thread stopped"
          description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
      - alert: Slave logging behind Master
        expr: rate(mysql_slave_status_seconds_behind_master[1m]) >30 
        for: 1m
        labels:
          severity: warning 
        annotations:
          summary: "Instance {{ $labels.instance }} Slave lagging behind Master"
          description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"
      - alert: Slave is NOT read only(Please ignore this warning indicator.)
        expr: mysql_global_variables_read_only != 0
        for: 1m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} Slave is NOT read only"
          description: "Slave is NOT set to read only. You can accidentally manipulate data on the slave and get inconsistencies..."
  • centos_rules.yml【服务器linux告警规则】
groups:
- name: Centos服务健康检测
  rules:
  - alert: 服务器已宕机 #告警名称
    expr: up == 0
    for: 1m #持续多久后发送
    labels:
      severity: warning
    annotations: #信息
      summary: "Node has been down"
      description: "has been down "

  - alert: "内存使用率过高"
    expr: round(100- node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100) > 80
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "内存使用率过高"
      description: "当前使用率{{ $value }}%"

  - alert: "CPU使用率过高"
    expr: round(100 - ((avg by (instance,job)(irate(node_cpu_seconds_total{mode="idle",instance!~'bac-.*'}[5m]))) *100)) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "CPU使用率过高"
      description: "当前使用率{{ $value }}%"

  - alert: "磁盘使用率过高"
    expr: round(100-100*(node_filesystem_avail_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"})) > 80
    for: 15s
    labels:
      severity: warning
    annotations:
      summary: "磁盘使用率过高"
      description: "当前磁盘{{$labels.mountpoint}} 使用率{{ $value }}%"

  - alert: "分区容量过低"
    expr: round(node_filesystem_avail_bytes{fstype=~"ext4|xfs",instance!~"testnode",mountpoint!~"/boot.*"}/1024/1024/1024) < 10
    for: 15s
    labels:
      severity: warning
    annotations:
      summary: "分区容量过低"
      description: "当前分区{{$labels.mountpoint}} 容量{{ $value }}GB"

  - alert: "网络流出速率过高"
    expr: round(irate(node_network_receive_bytes_total{instance!~"data.*",device!~'tap.*|veth.*|br.*|docker.*|vir.*|lo.*|vnet.*'}[1m])/1024) > 2048
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "网络流出速率过高"
      description: "当前速率{{ $value }}KB/s"
  • test_rules.yml【mysql告警规则测试】
groups:
    - name: mysql健康检测规则示例
      rules:
      - alert: MySQL is down
        expr: mysql_up == 0
        for: 1m
        labels:
        severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} MySQL is down"
          description: "MySQL database is down. This requires immediate action!"
  • grafana.ini【grafana初始化文件,可配合nginx使用】
[server]
# The full public facing url
root_url = %(protocol)s://%(domain)s:%(http_port)s/grafana
#root_url = %(protocol)s://%(domain)s/grafana/

评论


加好友,记得备注哦~

Powered by Halo
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×