一文读懂Prometheus架构监控

connygpt 2024-12-16 11:37 10 浏览

介绍

Prometheus 是一个系统监控和警报工具包。它是用 Go 编写的，由 Soundcloud 构建，并于 2016 年作为继 Kubernetes 之后的第二个托管项目加入云原生计算基金会（CNCF）。这意味着它目前是一个开源项目，独立于任何公司维护。它是监控本地解决方案和云工作负载的理想工具。

在 Prometheus 中，我们谈论维度数据- 时间序列由度量名称和一组键/值对标识：

Metric name: Speed 
Label: direction=forward 
Sample: 80

Prometheus 包含灵活的查询语言。指标可视化可以使用内置的表达式浏览器或 Grafana 等集成显示。

Prometheus是如何工作的？

Prometheus 通过抓取指标 HTTP 端点从受监控的目标收集指标。
这与大多数其他监控和警报系统不同，后者将指标推送到工具，或者使用自定义脚本对特定服务和系统执行检查。

刮擦也是最有效的机制之一。单个 Prometheus 服务器每秒能够以数百万个时间序列摄取多达一百万个样本。

Prometheus基本概念

所有数据都存储为一个时间序列，可以通过度量名称和一组称为标签的键值对来识别。

go_memstat_alloc_bytes{instance="localhost", job="prometheus"} 20

上面的指标有两个标签：instance和job，值 20.0，称为Sample。它可以是 float64 值或毫秒精度的时间戳。

Prometheus配置

yaml配置以格式存储在 Prometheus 配置文件中。无需重新启动 Prometheus 即可更改和应用该文件，这对于某些场景很有用。可以在执行时重新加载kill-SIGHUP <pid>。

也可以在启动时将参数（标志）传递给./prometheus，但这些参数不能即时更改，因此需要重新启动服务才能在此处应用更改。

--config.file配置文件在开始时使用标志传递。默认配置如下所示：

# my global config
global:
    scrape_interval:    15s # Set the scrape interval to every 15s. Default:60s
    evaluation_interval:15s # Evaluate rules every 15s. Default:60s
    # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
    alertmanagers:
    - static_configs:
        - targets:
        # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global `evaluation_interval`.
rule_files:
    # - "first_rules.yml"
    # - "second_rules.yml"

Prometheus target

要抓取指标，我们需要向 Prometheus 配置文件添加配置。例如，为了从 Prometheus 本身抓取指标，默认添加以下代码块（它是默认配置文件的一部分）：

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.

scrape_configs:
    # The job name is added as a label `job=<job_name>` to any time-series scraped from this config.
    - job_name = 'prometheus'

        # metrics_path defaults to `/metrics`
        # scheme defaults to `http`.

        static_configs:
        - targets: ['localhost:9090']

Prometheus监控节点

要监控节点，我们需要安装node-exported，安装了Exporters的节点将公开机器指标，例如 Linux/*Nix 机器的 CPU 和内存使用情况。它可以用来监控机器，然后我们可以根据这些摄取的指标创建警报。

对于 Windows，我们可以使用VMI-Exporter

Prometheus架构

监控

实际上，在监控代码之前，我们应该对其进行检测。Prometheus 有很多官方和非官方的库。但是即使我们使用的语言没有库，它们也可以以简单的基于文本的格式定义。

好的，但回到监控，我们有 4 种类型的指标：

计数器- 只会上升的值（例如访问网站）\
Gauge - 可以上下浮动的单个数值（例如 CPU 负载、温度）
直方图- 样本观察（例如请求持续时间或响应大小），这些观察被计入存储桶。包括 ( _count, 和_sum)。主要目的是计算分位数。
Summary- 类似于直方图，摘要抽样观察（例如请求持续时间或响应大小）。摘要还提供了观察总数和所有观察值的总和，它计算了滑动时间窗口上的可配置分位数。

Prometheus推送指标

正如我之前所说，默认情况下 Prometheus 更喜欢提取指标，但有时我们会遇到需要推送指标的情况——因为它们的寿命不够长，无法每 x 秒被刮一次。在这种情况下，我们可以将指标推送到Push Gateway

Push Gateway 用作允许你推送指标的中介服务。

陷阱：

大多数时候这是一个单一的实例，所以这会导致 SPOF
Prometheus 的自动实例健康监控是不可能的
推送网关永远不会忘记指标，除非它们通过 API 删除

推送网关功能：

push_to_gateway- 用相同的分组键替换指标
pushadd_to_gateway- 仅替换具有相同名称和分组键的指标
delete_from_gateway- 删除具有给定作业和分组键的指标

Prometheus查询

对于查询指标，Prometheus 提供了一种称为PromQL的表达式语言。它是一种只读语言，因此我们无法使用它插入任何数据。

Prometheus 中有 4 种可用的值类型，我们可以针对它们执行查询：

即时向量- 一组时间序列，每个时间序列包含一个样本，所有样本都共享相同的时间戳。例子：node_cpu_seconds_total
范围向量- 一组时间序列，其中包含每个时间序列随时间变化的数据点范围。示例：node_cpu_seconds_total[5m]
标量- 一个简单的数字浮点值。例子：-3.14
String - 一个简单的字符串值；目前未使用。例子：foobar

Prometheus操作符

为了处理指标，Prometheus还有一些操作符，它们可以帮助我们了解被监控的应用程序发生了什么，并允许我们将指标聚合和修改为更复杂的数据：

* Arithmetic binary operators
    - `-` (substraction)
    - `*` (multiplication)
    - `/` (division)
    - `%` (modulo)
    - `^` (power/exponentiation)
* Comparision binary operators
    - `==`,`!=`
    - `<`, `<=`, `>=`, `>`
* Logical/set binary operators
    - `and` (intersection)
    - `or` (union)
    - `unless` (complement)
* Aggregation operators
    - `sum` (calculate sum over dimensions)
    - `min` (select minimum over dimensions)
    - `max` (select maximum over dimensions)
    - `avg` (calculate average over dimensions)
    - `stddev` (calculate population standard deviation over dimensions)
    - `stdvar` (calculate population standard variance over dimensions)
    - `count` (count number of elements in the vector)
    - `count_values` (count number of elements with the same value)
    - `bottomk` (smallest k elements by sample value)
    - `topk`(largest k elements by sample value)
    - `quantile` (calculate fi-quantile (0 <= fi <= 1) over dimensions)

例子

以下是一些有用的查询示例，它们使用了一些运算符。我个人觉得它们很有用，并且在我的工作中有机会使用非常相似的。

up{job="prometheus"} 返回up带有标签的时间序列job=prometheus
http_requests_total{job=~=".*etheus"} 返回http_requests_total标签job匹配正则表达式的时间序列.*etheus（所有以“etheus”结尾的值）
http_requests_total{job="prometheus}[5m] 返回 5m 时间范围内的所有值。无法在 Prometheus 仪表板中绘制
http_requests_total{code!="2.."} 返回代码标签不匹配的时间序列2..。在这种情况下，我们过滤掉所有 HTTP 2XX 状态代码。
rate(http_requests_total[5m])http_requests_total 在过去 5 分钟内测量的具有名称的所有时间序列的每秒速率
sum(rate(http_requests_total[5m])) by (job) 根据标签job分组查看5分钟速率的http请求总和

Prometheus Exporters

在 Prometheus 中，我们还有一个名为 Prometheus Exporters 的概念。他们负责收集所有系统统计信息，如内存使用情况、CPU 等。主要目的是从现有的第三方指标中收集 Prometheus 指标。当 Prometheus 无法直接提取指标时（例如 haproxy 或 linux 系统统计信息），也会使用Exporters。

我们已经有很多可用的Exporters，例如 MySQL 服务器Exporters、Redis Exporters等。

所有可用的Exporters都可以在官方文档中找到。

要使用Exporters，我们需要将其定义添加到prometheus.yml文件中。

Prometheus Alertmanager

Prometheus 中的警报分为两部分：

Prometheus 服务器中的警报规则
Alertmanager，它是一个单独的组件。

警报规则定义 Prometheus 服务器何时应向 AlertManager 发出警报。在 AlertManager 中，我们定义了路由器和接收器。路由负责触发接收器（例如电子邮件、松弛、团队等）

警报规则

规则存在于 Prometheus 服务器配置中。它们基本上定义了我们何时想要使用其他一些设置（如严重性和描述）发出警报。他们将被触发的警报发送到外部服务。最佳实践是将警报与 Prometheus 配置分开。它们可以包含在prometheus.yml文件中，例如：

rule_files:
- "/etc/prometheus/alert.rules"

告警规则格式如下：

ALERT <alert name>
    IF <expression>
    [ FOR <duration> ]
    [ LABELS <label set> ]
    [ ANNOTATIONS <label set> ]

例如，我们为 CPU 使用率定义了一个警报。当机器处于超过 90% 的重负载 1 分钟时，此定义会触发严重警报：

groups:
- name: example
  rules:
  - alert: cpuUsage
    expr: cpu_percentage > 90
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Machine under heavy load

警报管理器

Alertmanager 处理 Prometheus 服务器发出的警报。它处理警报的重复数据删除、分组和路由。它还将警报路由到 MS Teams、电子邮件、Slack 等接收器，因此我们可以在出现任何问题时收到通知（当警报被触发时）。

Alertmanager 配置在/etc/alertmanager/alertmanager.yml文件中定义。这是一个例子：

# The root route with all parameters, which are inherited by the child
# routes if they are not overwritten.
route:
  receiver: 'default-receiver'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [cluster, alertname]
  # All alerts that do not match the following child routes
  # will remain at the root node and be dispatched to 'default-receiver'.
  routes:
  # All alerts with service=mysql or service=cassandra
  # are dispatched to the database pager.
  - receiver: 'database-pager'
    group_wait: 10s
    matchers:
    - service=~"mysql|cassandra"
  # All alerts with the team=frontend label match this sub-route.
  # They are grouped by product and environment rather than cluster
  # and alertname.
  - receiver: 'frontend-pager'
    group_by: [product, environment]
    matchers:
    - team="frontend"

默认情况下没有安装 Alertmanager，我们应该让 Prometheus 实例知道 Alertmanager 的存在。为此，我们需要prometheus.yml通过添加以下行来更改文件：

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

Alertmanager 支持将类似警报分组为 1 个通知的概念。因此，如果我们的应用程序在一段时间内 10 次触发有关未处理实体的警报，我们将只收到一个警报，说明应用程序失败 10 次。它确实有助于频繁发出警报 - 如果没有分组机制，通知系统将被大量类似的消息发送到垃圾邮件中。

如果一个指定的警报已经被触发，我们也可以禁止其他警报。如果我们的数据库出现故障，我们不需要每分钟都通知它！此外，如果我们知道我们计划了一些维护窗口，我们可以在此期间使警报静音。

警报可以处于以下三种状态之一：

非活动 - 不满足任何规则
待定 - 满足规则但可以抑制
触发 - 警报发送到配置的通道

希望阅读完这篇文章后，你对 Prometheus 有更多的了解。你知道它可用于监控任何应用程序，并在出现任何问题时发出警报。你也知道它的基本配置是什么样子的。

如果你发现我的任何文章有帮助或有用，麻烦点赞或者转发。谢谢！

prometheusrule

上一篇：关于K8S Operator的那点“破”事
下一篇：最新 client-java 调用 k8s ApiServer

一文读懂Prometheus架构监控

介绍

Prometheus是如何工作的？

Prometheus基本概念

Prometheus配置

Prometheus target

Prometheus监控节点

Prometheus架构

监控

Prometheus推送指标

Prometheus查询

Prometheus操作符

例子

Prometheus Exporters

Prometheus Alertmanager

警报规则

警报管理器

相关推荐

在.net core中使用nginx做负载均衡

React 18 超全升级指南

LTUI v1.7 发布，一个基于 Lua 的跨平台字符终端 UI 界面库

生成对抗网络(GAN)的半监督学习

《若依ruoyi》第二十五章:Spring boot 上传下载封装详解二

QT进阶之路 : 布局详解

如何在 Element UI 中使用栅格布局实现响应式设计?

IDEA 中 Jetty 的配置操作手册 idea jfinal

如何在Dify平台上创建智能Agent:一步步教你实现超级智能体搭建

k8s自动化运维三