Установка системы мониторинга prometheus на linux

Введение

Я не буду вводить справочную информацию о четырех компонентах Grafana, Prometheus, Exporter и cAdvisor. Здесь в основном поговорим об отношениях между ними вчетвером.

3.1 Предварительные знания

При написании приложения вы обычно записываете журнал для постанализа. Во многих случаях вы можете проверить журнал после возникновения проблемы, что является посмертным статическим анализом. Во многих случаях нам может потребоваться понять текущую ситуацию всей системы или работу всей системы в определенный момент, например, сколько услуг предоставляется внешнему миру в текущей системе, каково время отклика эти службы, и что ситуация меняется с течением времени. Какова частота системных ошибок. Эта динамическая информация в квазиреальном времени очень важна для мониторинга работоспособности всей системы.

Таким образом создаются такие данные, как метрики, которые выглядят как https://monitor.lucien.ink/metrics.

3.2 Взаимоотношения Основная задача Exporter — предоставить информацию о показателях. и метрики не понятны большинству людей, поэтому Prometheus предоставляет язык запросов Prometheus (PromQL) для информации в этом формате, который может выполнять совместные запросы, фильтрацию и т. д., аналогичные операциям с базами данных, поэтому что мы можем извлекать то, что хотим, аналогично использованию памяти, нагрузке и т. д. Общий процесс таков: сбор информации о метриках с удаленного конца (их может быть несколько) для локального уточнения информации с помощью различных QL. Хотя PromQL очень мощный, он требует больших затрат на обучение для большинства людей. Поэтому Grafana инкапсулирует различные PromQL и отображает результаты PromQL в виде диаграмм. Показана форма. , вероятно, представляет собой такой процесс, как производство, обработка и вторичная обработка.

Конечно, функций Prometheus и Grafana гораздо больше, чем мощнее функция будильника, но это не тема данной статьи.

3.3 Exporter Стоит упомянуть, что компонент Exporter — это тип компонента, и его основная функция заключается в предоставлении информации о показателях для обработки и уточнения.

Некоторые компоненты предоставляют информацию о показателях сами по себе, например, Grafana, Prometheus, Etcd и т. Д. Показатели, приведенные в этой статье, генерируются самой Grafana.

Некоторые компоненты не предоставляют информацию о показателях, например некоторые программы, которые мы написали сами.

А некоторые даже не являются компонентами, например, сама система Linux.

3.4 CAdvisor CAdvisor — это инструмент с открытым исходным кодом, разработанный Google для анализа использования ресурсов и показателей производительности запущенных контейнеров. CAdvisor — это исполняющий демон, который отвечает за сбор, агрегирование, обработку и вывод информации о запущенных контейнерах.

Введение

Современные и новые решения всегда интересны для изучения и применения, особенно, когда привык к классическому Zabbix. Поэтому появление Prometheus и Grafana с её нескучными дашбордами вызвало интерес для первоначального ознакомления.

Prometheus является базой данных временных рядов (как InfluxDB, например), но с дополнительными инструментами для мониторинга. По сравнению с классическими реляционными базами данных, такие базы также работают с данными в таблицах, но все эти таблицы объединены временем, что позволяет им работать быстрее . Подходят такие решения для хранения различных метрик с привязкой ко времени и быстрых выборок.

Реальный кейс, который сподвигнул меня познакомиться с Prometheus – это относительно старый сервер с RHEL 6 и MySQL 5.6, на котором абсолютно рандомно по частоте возникновения, но примерно в одно и тоже время происходит утечка памяти и демон mysqld отстреливается oom killer`ом. При этом были применены различные практики по настройке как MySQL, так и сервера – проблема стала реже, но тем не менее имеет место быть. Для дальнейшего решения и более лучшего понимания, что происходит в процессах системы, было решено использовать process-exporter.

Можно по-старинке написать bash-скрипт, который каждую минуту будет складировать вывод ps aux, а можно использовать time series базы данных, что и будет рассмотрено в дальнейшем.

Prometheus

Prometheus не устанавливается из репозитория и имеет, относительно, сложный процесс установки. Необходимо скачать исходник, создать пользователя, вручную скопировать нужные файлы, назначить права и создать юнит для автозапуска.

Загрузка

Переходим на официальную страницу загрузки и копируем ссылку на пакет для Linux:

… и используем ее для загрузки пакета на Linux:

wget https://github.com/prometheus/prometheus/releases/download/v2.20.1/prometheus-2.20.1.linux-amd64.tar.gz

* если система вернет ошибку, необходимо установить пакет wget.

Установка (копирование файлов)

После того, как мы скачали архив prometheus, необходимо его распаковать и скопировать содержимое по разным каталогам.

Для начала создаем каталоги, в которые скопируем файлы для prometheus:

mkdir /etc/prometheus

mkdir /var/lib/prometheus

Распакуем наш архив:

tar zxvf prometheus-*.linux-amd64.tar.gz

… и перейдем в каталог с распакованными файлами:

cd prometheus-*.linux-amd64

Распределяем файлы по каталогам:

cp prometheus promtool /usr/local/bin/

cp -r console_libraries consoles prometheus.yml /etc/prometheus

Назначение прав

Создаем пользователя, от которого будем запускать систему мониторинга:

useradd —no-create-home —shell /bin/false prometheus

* мы создали пользователя prometheus без домашней директории и без возможности входа в консоль сервера.

Задаем владельца для каталогов, которые мы создали на предыдущем шаге:

chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

Задаем владельца для скопированных файлов:

chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}

Запуск и проверка

Запускаем prometheus командой:

/usr/local/bin/prometheus —config.file /etc/prometheus/prometheus.yml —storage.tsdb.path /var/lib/prometheus/ —web.console.templates=/etc/prometheus/consoles —web.console.libraries=/etc/prometheus/console_libraries

… мы увидим лог запуска — в конце «Server is ready to receive web requests»:

level=info ts=2019-08-07T07:39:06.849Z caller=main.go:621 msg=»Server is ready to receive web requests.»

Открываем веб-браузер и переходим по адресу http://<IP-адрес сервера>:9090 — загрузится консоль Prometheus:

Установка завершена.

Автозапуск

Мы установили наш сервер мониторинга, но его необходимо запускать вручную, что совсем не подходит для серверных задач. Для настройки автоматического старта Prometheus мы создадим новый юнит в systemd.

Возвращаемся к консоли сервера и прерываем работу Prometheus с помощью комбинации Ctrl + C. Создаем файл prometheus.service:

vi /etc/systemd/system/prometheus.service

Description=Prometheus Service
After=network.target

User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
—config.file /etc/prometheus/prometheus.yml \
—storage.tsdb.path /var/lib/prometheus/ \
—web.console.templates=/etc/prometheus/consoles \
—web.console.libraries=/etc/prometheus/console_libraries
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure

WantedBy=multi-user.target

Перечитываем конфигурацию systemd:

systemctl daemon-reload

Разрешаем автозапуск:

systemctl enable prometheus

После ручного запуска мониторинга, который мы делали для проверки, могли сбиться права на папку библиотек — снова зададим ей владельца:

chown -R prometheus:prometheus /var/lib/prometheus

Запускаем службу:

systemctl start prometheus

… и проверяем, что она запустилась корректно:

systemctl status prometheus

Инструментирование приложения

Существуют два подхода к инструментированию, которые подразумевают, что вы открываете конечную точку HTTP/s. По умолчанию конечная точка — /metrics, но может быть настроена в файле prometheus.yml. Prometheus будет использовать эту конечную точку для очистки показателей с регулярными интервалами, например, каждые 5 или 30 секунд.

Нужно ли изменять код приложения?

Вы можете сделать конечную точку /metrics частью кода в существующем приложении. Это означает, что у вас уже есть необходимые секреты и учетные данные для взаимодействия с бизнесом / платежами или уровнями базы данных. Недостатком является то, что вам нужно включить новую библиотеку, конечную точку и зависимость в свой продукт или проект.

Есть ли другой вариант?

Также возможно писать раздельные процессы, которые действуют как оболочка для предоставления информации из вашего приложения или среды. Ed’s Docker Hub — экспортер располагает данными из внешнего API, которые он не контролирует. Такой вариант пригодится, если нельзя получить разрешение на изменение существующего приложения.

Преимущество отдельного процесса заключается в том, что вы можете обновлять то, что контролируете, не переустанавливая своиприложения.

Digital Ocean Setup (Optional)

Installation & Configuration

For a one click install experience run the following command:

At this point you’ll have automagically deployed the entire Grafana and Prometheus stack. You can now access the Grafana dashboard at Username: , Password: . Note: before the dashboards will work you need to follow the .

Here’s a list of all the services that are created:

Service	Port	Description	Notes
Prometheus	:9090	Data Aggregator
Alert Manager	:9093	Adds Alerting for Prometheus Checks
Grafana	:3000	UI To Show Prometheus Data	Username: , Password:
Node Exporter	:9100	Data Collector for Computer Stats
CA Advisor	:8080	Collect resource usage of the Docker container
Blackbox Exporter	:9115	Data Collector for Ping & Uptime

Utility Scripts

We’ve provided some utility scripts in the folder.

Script	Args	Description	Example
docker-log.sh	service	List the logs of a docker service by name	./util/docker-log.sh grafana
docker-nuke.sh	service	Removes docker services and volumes created by this project	./util/docker-nuke.sh
docker-ssh.sh	service	SSH into a service container	./util/docker-ssh.sh grafana
high-load.sh		Simulate high CPU load on the current computer	./util/high-load.sh
restart.sh		Restart all services	./util/restart.sh
start.sh		Start all services	./util/start.sh
status.sh		Print status all services	./util/status.sh
stop.sh		Stop all services	./util/stop.sh

Define alerts

Three alert groups have been setup within the alert.rules configuration file:

Monitoring services alerts
Docker Host alerts
Docker Containers alerts

You can modify the alert rules and reload them by making a HTTP POST call to Prometheus:

Monitoring services alerts

Trigger an alert if any of the monitoring targets (node-exporter and cAdvisor) are down for more than 30 seconds:

- alert: monitor_service_down
    expr: up == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "Monitor service non-operational"
      description: "Service {{ $labels.instance }} is down."

Docker Host alerts

Trigger an alert if the Docker host CPU is under high load for more than 30 seconds:

- alert: high_cpu_load
    expr: node_load1 > 1.5
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "Server under high load"
      description: "Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."

Modify the load threshold based on your CPU cores.

Trigger an alert if the Docker host memory is almost full:

- alert: high_memory_load
    expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) ) / sum(node_memory_MemTotal_bytes) * 100 > 85
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "Server memory is almost full"
      description: "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."

Trigger an alert if the Docker host storage is almost full:

- alert: high_storage_load
    expr: (node_filesystem_size_bytes{fstype="aufs"} - node_filesystem_free_bytes{fstype="aufs"}) / node_filesystem_size_bytes{fstype="aufs"}  * 100 > 85
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "Server storage is almost full"
      description: "Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."

Docker Containers alerts

Trigger an alert if a container is down for more than 30 seconds:

- alert: jenkins_down
    expr: absent(container_memory_usage_bytes{name="jenkins"})
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "Jenkins down"
      description: "Jenkins container is down for more than 30 seconds."

Trigger an alert if a container is using more than 10% of total CPU cores for more than 30 seconds:

- alert: jenkins_high_cpu
    expr: sum(rate(container_cpu_usage_seconds_total{name="jenkins"})) / count(node_cpu_seconds_total{mode="system"}) * 100 > 10
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "Jenkins high CPU usage"
      description: "Jenkins CPU usage is {{ humanize $value}}%."

Trigger an alert if a container is using more than 1.2GB of RAM for more than 30 seconds:

- alert: jenkins_high_memory
    expr: sum(container_memory_usage_bytes{name="jenkins"}) > 1200000000
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "Jenkins high memory usage"
      description: "Jenkins memory consumption is at {{ humanize $value}}."

Testing

Run to start the target Docker container on your local engine.
Use to log in to the running container.
Edit the role files.
Add other required roles (external) in the molecule/default/requirements.yml file.
Edit the molecule/default/playbook.yml.
Define infra tests under the molecule/default/tests folder using the goos verifier.
When ready, use to run the Ansible Playbook and to execute the test suite.
Note that the converge process starts performing a syntax check of the role.
Destroy the Docker container with the command .

To run all the steps with just one command, run .

In order to run the role targeting a VM, use the playbook_deploy.yml file for example with the following command: .

Deploy blackbox_exporter

config file https://github.com/prometheus/blackbox_exporter/blob/master/blackbox.yml

Set prometheus_exporter_custom_conf_destination variable for deploy configuration file in a specific location

default-value: "` prometheus_exporters_common_root_dir `/`prometheus_exporter_name`_current"

prometheus_exporter_conf_main config file location in playbook dir:

example:

prometheus_exporter_conf_main: black_box_exporter_example_config.yaml

file location:

$PLAYBOOKPATH/black_box_exporter_example_config.yaml

prometheus_exporter_conf_main: prometheus_cof/black_boxexporter/black_box_exporter_example_config.yaml

file location:

$PLAYBOOKPATH/prometheus_cof/black_boxexporter/black_box_exporter_example_config.yaml

prometheus_exporter_conf_main: black_box_exporter_example_config.yaml

Playbook

- name: install blackbox_exporter on group
  hosts: blackbox_exporter
  roles:
    - role: ansible-prometheus-exporter
      prometheus_exporter_name: blackbox_exporter
      prometheus_exporter_version: 0.12.0
      # path to playbookpath/`prometheus_exporter_conf_main` custom path
      prometheus_exporter_conf_main: black_box_exporter_example_config.yaml
      prometheus_exporter_config_flags:
        "--config.file": "` prometheus_exporter_custom_conf_destination `/black_box_exporter_example_config.yaml"

Setup Grafana

Navigate to and login with user admin password admin. You can change the credentials in the compose file or by supplying the and environment variables via .env file on compose up. The config file can be added directly in grafana part like this

and the config file format should have this content

If you want to change the password, you have to remove this entry, otherwise the change will not take effect

Grafana is preconfigured with dashboards and Prometheus as the default data source:

Name: Prometheus
Type: Prometheus
Access: proxy

Docker Host Dashboard

The Docker Host Dashboard shows key metrics for monitoring the resource usage of your server:

Server uptime, CPU idle percent, number of CPU cores, available memory, swap and storage
System load average graph, running and blocked by IO processes graph, interrupts graph
CPU usage graph by mode (guest, idle, iowait, irq, nice, softirq, steal, system, user)
Memory usage graph by distribution (used, free, buffers, cached)
IO usage graph (read Bps, read Bps and IO time)
Network usage graph by device (inbound Bps, Outbound Bps)
Swap usage and activity graphs

For storage and particularly Free Storage graph, you have to specify the fstype in grafana graph request.
You can find it in , at line 480 :

I work on BTRFS, so i need to change to .

You can find right value for your system in Prometheus launching this request :

Docker Containers Dashboard

The Docker Containers Dashboard shows key metrics for monitoring running containers:

Total containers CPU load, memory and storage usage
Running containers graph, system load graph, IO usage graph
Container CPU usage graph
Container memory usage graph
Container cached memory usage graph
Container network inbound usage graph
Container network outbound usage graph

Note that this dashboard doesn’t show the containers that are part of the monitoring stack.

Monitor Services Dashboard

The Monitor Services Dashboard shows key metrics for monitoring the containers that make up the monitoring stack:

Prometheus container uptime, monitoring stack total memory usage, Prometheus local storage memory chunks and series
Container CPU usage graph
Container memory usage graph
Prometheus chunks to persist and persistence urgency graphs
Prometheus chunks ops and checkpoint duration graphs
Prometheus samples ingested rate, target scrapes and scrape duration graphs
Prometheus HTTP requests graph
Prometheus alerts graph

Login to Grafana and Visualize Metrics

Grafana is an Open Source visualization tool for the metrics collected with Prometheus. Next, open Grafana to view the Traefik Dashboards.
Note: Firefox doesn’t properly work with the below URLS please use Chrome

Username: admin
Password: foobar

Open the Traefik Dashboard and select the different backends available

Note: Upper right-hand corner of Grafana switch the default 1 hour time range down to 5 minutes. Refresh a couple times and you should see data start flowing

Production Security:

Here are just a couple security considerations for this stack to help you get started.

Remove the published ports from Prometheus and Alerting servicesi and only allow Grafana to be accessed
Terminate all services/containers via HTTPS/SSL/TLS

Troubleshooting

It appears some people have reported no data appearing in Grafana. If this is happening to you be sure to check the time range being queried within Grafana to ensure it is using Today’s date with current time.

node-exporter

Чтобы облегчить операцию, сначала создаем Каталог, используемый для хранения файлов этого эксперимента, например, каталог, созданный автором,。

Официальная документация Node-exporter (далее именуемого экспортером) не рекомендует использовать docker для установки экспортера.Для удобства эксперимента в этой статье проверяется его использование, запустив экспортер в контейнере centos. В производственной среде экспортер можно установить прямо на хост.

Чтобы использовать docker для запуска экспортера, вам необходимо установить docker и docker-compose.

Запустить экспортер Это следующие файлы:

Запустите два контейнера centos, и каждый контейнер запускает экспортер для генерации метрик хоста, необходимых для prometheus.

Порт по умолчанию, запускаемый экспортером, — 9100. Для внешнего доступа мы сопоставляем их с портами 9101 и 9102 соответственно.

Чтобы выполнить экспортер внутри контейнера centos, мы используем метод bind mount и используем Выполнить экспортера.

выполненный Для успешного запуска контейнера с , Откройте браузер, введите адрес с Вы можете увидеть вывод метрик.

Example Playbook

Prometheus server

The following example installs Prometheus (server), alertmanager, blackbox_exporter, and the node_exporter. The Prometheus (server) port and storage retention parameters have been changed from the defaults.

The Prometheus server should be installed only on designated Prometheus server hosts. Prometheus clients should only have select and specific exporters installed.

Class use method:

- hosts: prometheus_servers
  vars:
    prometheus_components:
      - prometheus
      - alertmanager
      - blackbox_exporter
      - node_exporter
    prometheus_port: 10000
    prometheus_extra_opts:
     - '--storage.tsdb.retention=90d'
  roles:
    - mesaguy.prometheus

Longer ‘include_role’ use method:

- hosts: prometheus_servers
  vars:
    prometheus_port: 10000
    prometheus_extra_opts:
     - '--storage.tsdb.retention=90d'
  tasks:
  - name: Prometheus server
    include_role:
      name: mesaguy.prometheus
      tasks_from: '` prometheus_component `'
    loop_control:
      loop_var: prometheus_component
    with_items:
      - prometheus
      - alertmanager
      - blackbox_exporter
      - node_exporter

Getting started

The only things you need to run these examples are and a copy
of this repo. Everything else happens inside the docker containers.

A note about the hideous command lines. In order to make this a modular experiment
I’ve extracted the separate sections of config in to different directories.
While this allows you to spin up a test site with any combination of services
and exporters it does mean you’ll need to add a
argument for each service you want to include in the test. I avoid the pain by
setting an alias:

And then use commands like and . In the README examples I’ll
use the full commands for clarity but you won’t have to.

Creating Prometheus

The first part of the infrastructure you should build, and the one depended on by
all the example service configurations in other directories, is
prometheus-server. This will create
both a prometheus and grafana container. At the moment we’ll have to manually
link these together.

From the root of this repo run the command to create the docker containers.

On the first run this might take a little while as it fetches all the
containers. Once it returns you can confirm the containers are running:

and view their output:

When you’re finished you can remove the containers, but don’t do that yet.

Congratulations! You now have a prometheus and grafana test instance and you can
experiment with making your own scrape backed graphs. You’ll soon want to expand
into data from other services, and an ideal place to start is with
Redis.

Existing Services

This repo currently contains example configurations for the following
services and their respective exporters:

Memcached
Node exporter — just an exporter running against your local
host
PostgreSQL
Prometheus and Grafana
Redis Server

Networking

All the containers are created inside a single docker network and reference each
other by the magic of their service names. They can also be reached from the
host on . This allows easier access to the prometheus and grafana
dashboards and means you can easily add sample data to the graphs by running
command such as in a loop or pointing a load tester at them.

IV — Дальше

Освоение Node Exporter, безусловно, является обязательным навыком для инженеров, желающих начать работу с Prometheus.

Однако вы можете копнуть немного глубже, используя Node Exporter.

а — Дополнительные модули

Не все модули включены по умолчанию, и если вы запустите простую установку экспортера узлов, скорее всего, вы не используете какие-либо дополнительные плагины.

Вот список дополнительных модулей

Чтобы активировать их, просто добавьте флаг –collector. <name> при запуске экспортера узлов, например:

ExecStart=/usr/local/bin/node_exporter --collector.processes --collector.ntp

Это должно активировать процессы и сборщики ntp .

b — Сборщик текстовых файлов

Полное руководство по Node Exporter было бы неполным без упоминания сборщика текстовых файлов, по крайней мере, для небольшого раздела.

Подобно Pushgateway, сборщик текстовых файлов собирает метрики из текстовых файлов и сохраняет их прямо в Prometheus.

Он разработан для пакетных заданий или краткосрочных заданий, которые не отображают показатели непрерывно.

Некоторые примеры сборщика текстовых файлов доступны здесь:

Использование сборщика текстовых файлов из сценария оболочки.
Мониторинг размеров каталогов с помощью сборщика текстовых файлов.

Architecture and layout

One of the key goals in this experiment is to keep it as modular as possible
and allow you to create container networks of whichever combination you need.
Does your application use PostgreSQL and redis? Add a new for your application itself and just include and on the
command line to create those backing services and collect metrics on them all.

To implement this we have a subdirectory for each different thing we
want to collect metrics for. This contains the prometheus target
configuration file, mostly in and a file that defines how to run the service inside a
container. Critically, the compose file contains
an additional service definition.

Docker compose has a wonderful feature that ensures additional values for a
service, even one defined in a separate docker-compose file, are
merged to create a configuration that contains all encountered keys. In
the case of this repo it means we can define the basic prometheus checks
in the base file and add the additional checks as we
include the services they target.

Local Testing

The preferred way of locally testing the role is to use Docker and molecule (v2.x). You will have to install Docker on your system. See «Get started» for a Docker package suitable to for your system.
We are using tox to simplify process of testing on multiple ansible versions. To install tox execute:

pip3 install tox

To run tests on all ansible versions (WARNING: this can take some time)

tox

To run a custom molecule command on custom environment with only default test scenario:

tox -e py35-ansible28 -- molecule test -s default

If you would like to run tests on remote docker host just specify variable before running tox tests.

Мониторинг служб Linux

Для мониторинга сервисов с помощью Prometheus мы настроим сбор метрик и отображение тревог.

Сбор метрие с помощью node_exporter

Открываем сервис, созданный для node_exporter:

vi /etc/systemd/system/node_exporter.service

… и добавим к ExecStart:

…
ExecStart=/usr/local/bin/node_exporter —collector.systemd
…

* данная опция указывает экспортеру мониторить состояние каждой службы.

При необходимости, мы можем либо мониторить отдельные службы, добавив опцию collector.systemd.unit-whitelist:

ExecStart=/usr/local/bin/node_exporter —collector.systemd —collector.systemd.unit-whitelist=»(chronyd|mariadb|nginx).service»

* в данном примере будут мониториться только сервисы chronyd, mariadb и nginx.

… либо наоборот — мониторить все службы, кроме отдельно взятых:

ExecStart=/usr/local/bin/node_exporter —collector.systemd —collector.systemd.unit-blacklist=»(auditd|dbus|kdump).service»

* при такой настройке мы запретим мониторинг сервисов auditd, dbus и kdump.

Чтобы применить настройки, перечитываем конфиг systemd:

systemctl daemon-reload

Перезапускаем node_exporter:

systemctl restart node_exporter

Отображение тревог

Настроим мониторинг для службы NGINX.

Создаем файл с правилом:

vi /etc/prometheus/services.rules.yml

groups:
— name: services.rules
  rules:
    — alert: nginx_service
     expr: node_systemd_unit_state{name=»nginx.service»,state=»active»} == 0
     for: 1s
     annotations:
       summary: «Instance {{ $labels.instance }} is down»
       description: «{{ $labels.instance }} of job {{ $labels.job }} is down.»

Подключим файл с описанием правил в конфигурационном файле prometheus:

vi /etc/prometheus/prometheus.yml

…
rule_files:
# — «first_rules.yml»
# — «second_rules.yml»
— «alert.rules.yml»
— «services.rules.yml»
…

* в данном примере мы добавили наш файл services.rules.yml к уже ранее добавленному alert.rules.yml в секцию rule_files.

Перезапускаем prometheus:

systemctl restart prometheus

Для проверки, остановим наш сервис:

systemctl stop nginx

В консоли Prometheus в разделе Alerts мы должны увидеть тревогу:

Отправка уведомлений

Теперь настроим связку с алерт менеджером для отправки уведомлений на почту.

Настроим alertmanager:

vi /etc/alertmanager/alertmanager.yml

В секцию global добавим:

global:
…
smtp_from: monitoring@dmosk.ru

Приведем секцию route к виду:

… далее добавим еще один ресивер:

* в данном примере мы отправляем сообщение на почтовый ящик alert@dmosk.ru с локального сервера

Обратите внимание, что для отправки почты наружу у нас должен быть корректно настроенный почтовый сервер (в противном случае, почта может попадать в СПАМ)

Перезапустим сервис для алерт менеджера:

systemctl restart alertmanager

Теперь настроим связку prometheus с alertmanager — открываем конфигурационный файл сервера мониторинга:

vi /etc/prometheus/prometheus.yml

Приведем секцию alerting к виду:

alerting:
alertmanagers:
— static_configs:
— targets:
— 192.168.0.14:9093

* где 192.168.0.14 — IP-адрес сервера, на котором у нас стоит alertmanager.

Перезапускаем сервис:

systemctl restart prometheus

Немного ждем и заходим на веб интерфейс алерт менеджера — мы должны увидеть тревогу:

… а на почтовый ящик должно прийти письмо с тревогой.

Подготовка сервера

Настроим некоторые параметры сервера, необходимые для правильно работы системы.

Время

Для отображения событий в правильное время, необходимо настроить его синхронизацию. Для этого установим chrony:

а) если на системе CentOS / Red Hat:

yum install chrony

systemctl enable chronyd

systemctl start chronyd

б) если на системе Ubuntu / Debian:

apt-get install chrony

systemctl enable chrony

systemctl start chrony

Брандмауэр

На фаерволе, при его использовании, необходимо открыть порты:

TCP 9090 — http для сервера прометеус.
TCP 9093 — http для алерт менеджера.
TCP и UDP 9094 — для алерт менеджера.
TCP 9100 — для node_exporter.

а) с помощью firewalld:

firewall-cmd —permanent —add-port=9090/tcp —add-port=9093/tcp —add-port=9094/{tcp,udp} —add-port=9100/tcp

firewall-cmd —reload

б) с помощью iptables:

iptables -I INPUT 1 -p tcp —match multiport —dports 9090,9093,9094,9100 -j ACCEPT

iptables -A INPUT -p udp —dport 9094 -j ACCEPT

в) с помощью ufw:

ufw allow 9090,9093,9094,9100/tcp

ufw allow 9094/udp

ufw reload

SELinux

По умолчанию, SELinux работает в операционный системах на базе Red Hat. Проверяем, работает ли она в нашей системе:

getenforce

Если мы получаем в ответ:

Enforcing

… необходимо отключить его командами:

setenforce 0

sed -i ‘s/^SELINUX=.*/SELINUX=disabled/g’ /etc/selinux/config

* если же мы получим ответ The program ‘getenforce’ is currently not installed, то SELinux не установлен в системе.

Что вы узнаете

Прежде чем начать полное руководство, давайте взглянем на все различные темы, которые вы собираетесь изучать сегодня.

Существующие способы мониторинга вашей Linux-системы : вы узнаете о бесплатных и платных инструментах, которые вы можете использовать для быстрого мониторинга своей инфраструктуры.
Что такое Node Exporter и как правильно установить его как услугу
Свяжите свой Node Exporter с Prometheus и начните собирать системные метрики
Играйте с готовыми панелями управления Grafana, чтобы создать более 100 панелей одним щелчком мыши

В конце этого руководства вы сможете создать свою собственную инфраструктуру мониторинга и добавить к ней еще много экспортеров.