Add Docker host disk pressure monitoring

pull/1719/head
Carlo Costanzo 1 month ago
parent 84cec3fff9
commit 0b073c156a

@ -6,7 +6,7 @@
# Logbook Configuration - Activity/Logbook display controls
# Defines what is hidden from the Activity/logbook view to keep noise down.
# -------------------------------------------------------------------
# Notes: Filters vcloudinfo availability chatter plus location/weather noise.
# Notes: Filters vcloudinfo availability chatter plus location/weather noise and raw Glances host telemetry.
######################################################################
exclude:
@ -35,6 +35,11 @@ exclude:
- sensor.*_activity
- sensor.*_bssid
- sensor.*_wifi_signal_strength
- sensor.192_168_10_17_*
- sensor.docker14_*
- sensor.docker69_*
- sensor.docker_*_disk_used_percentage
- input_text.docker_*_disk_pressure_band
- switch.*_container
- "*alarm_panel_1*"
- "*alarm_panel_2*"

@ -46,11 +46,11 @@ Live collection of plug-and-play Home Assistant packages. Each YAML file in this
| [lightning.yaml](lightning.yaml) | Blitzortung lightning counter monitoring with snoozeable push actions. | `sensor.blitzortung_lightning_counter`, `input_boolean.snooze_lightning`, notify engine actions |
| [logbook_activity_feed.yaml](logbook_activity_feed.yaml) | Dummy `sensor.activity_feed` + helper to write clean Activity entries (Issue #1550). | `sensor.activity_feed`, `script.send_to_logbook` |
| [mariadb_monitoring.yaml](mariadb_monitoring.yaml) | MariaDB health sensors and Lovelace dashboard snippet for recorder stats. | `sensor.mariadb_status`, `sensor.database_size` |
| [docker_infrastructure.yaml](docker_infrastructure.yaml) | Docker host patching telemetry + container/stack Repairs automation, 20-minute Joanna escalation for persistent container outages using stable configured monitor membership, and weekly scheduled prune actions across docker_10/14/17/69. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `script.joanna_dispatch` |
| [docker_infrastructure.yaml](docker_infrastructure.yaml) | Docker host patching telemetry, container/stack Repairs automation, 20-minute Joanna escalation for persistent container outages using stable configured monitor membership, and weekly scheduled prune actions across docker_10/14/17/69. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `script.joanna_dispatch` |
| [github_watched_repo_scout.yaml](github_watched_repo_scout.yaml) | Nightly Joanna dispatch that reviews unread notifications from watched GitHub repos, recommends HA-config ideas, refreshes strong-candidate issues, and marks processed watched-repo notifications read. | `automation.github_watched_repo_scout_nightly`, `script.joanna_dispatch`, `script.send_to_logbook` |
| [proxmox.yaml](proxmox.yaml) | Proxmox runtime and disk pressure monitoring with Repairs + Joanna dispatch for sustained node degradations, plus nightly Frigate reboot. | `binary_sensor.proxmox*_runtime_healthy`, `sensor.proxmox*_disk_used_percentage`, `repairs.create`, `script.joanna_dispatch`, `button.qemu_docker2_101_reboot` |
| [synology_dsm.yaml](synology_dsm.yaml) | Synology DSM integration health normalization for Carlo-NAS01 and Carlo-NVR, with Repairs + Joanna dispatch on sustained integration, security, or storage problems. | `binary_sensor.carlo_*_synology_problem`, `sensor.carlo_*_synology_problem_summary`, `repairs.create`, `script.joanna_dispatch` |
| [infrastructure.yaml](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health + website uptime/latency SLO signals for Infrastructure dashboards, plus nightly backup verification and monthly Joanna HA log hygiene review with GitHub issue follow-up. | `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `automation.infra_monthly_log_hygiene_review`, `script.joanna_dispatch` |
| [infrastructure.yaml](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health, Glances-backed Docker host disk pressure, and website uptime/latency SLO signals for Infrastructure dashboards, plus nightly backup verification and monthly Joanna HA log hygiene review with GitHub issue follow-up. | `sensor.docker_*_disk_used_percentage`, `automation.docker_host_disk_pressure_monitor`, `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `script.joanna_dispatch` |
| [onenote_indexer.yaml](onenote_indexer.yaml) | OneNote indexer health/status monitoring for Joanna, failure-repair automation, and a daily duplicate-delete maintenance request. | `sensor.onenote_indexer_last_job_status`, `binary_sensor.onenote_indexer_last_job_successful` |
| [mqtt_status.yaml](mqtt_status.yaml) | Command-line MQTT broker reachability probe with Spook Repairs escalation and Joanna troubleshooting dispatch on outage. | `binary_sensor.mqtt_status_raw`, `binary_sensor.mqtt_broker_problem`, `repairs.create`, `rest_command.bearclaw_command` |
| [mariadb.yaml](mariadb.yaml) | MariaDB recorder health and capacity snapshots with hourly live metrics, weekly admin/recorder polling, and stats-ready numeric sensors. | `sensor.mariadb_status`, `sensor.database_size` |

@ -5,7 +5,7 @@
# -------------------------------------------------------------------
# Docker Infrastructure - Host patching and container alerts
# Related Issue: 1632, 1584
# APT webhook results (docker_10/14/17/69) and container down repairs.
# APT results and container down repairs.
# -------------------------------------------------------------------
# Notes: Hosts run weekly Wed 12:00 APT job and POST JSON to webhooks.
# Notes: Reboots are handled directly on each host by apt_weekly.sh.
@ -1157,7 +1157,7 @@ automation:
action:
- variables:
down_items: "{{ state_attr('sensor.docker_containers_down_list', 'down_containers') | default([], true) | list }}"
down_count: "{{ down_items | count }}"
down_count: "{{ states('sensor.docker_containers_down_count') | int(0) }}"
- service: script.send_to_logbook
data:
topic: "DOCKER"
@ -1242,9 +1242,8 @@ automation:
- platform: time
at: "03:15:00"
condition:
- condition: time
weekday:
- sun
- condition: template
value_template: "{{ now().weekday() == 6 }}"
action:
- service: button.press
target:

@ -3,8 +3,8 @@
# For more info visit https://www.vcloudinfo.com/click-here
# Original Repo : https://github.com/CCOSTAN/Home-AssistantConfig
# -------------------------------------------------------------------
# Infrastructure - Observability and Joanna review workflows
# WAN/DNS/website/domain/cert state normalized for dashboards, plus scheduled infrastructure reviews.
# Infrastructure - Observability, disk pressure, and Joanna review workflows
# WAN/DNS/website/domain/cert/Docker host state normalized for dashboards, plus scheduled infrastructure reviews.
# -------------------------------------------------------------------
# Related Issue: 1584
# Notes: Home dashboard consumes `infra_*` entities for exceptions-only alerts.
@ -12,8 +12,20 @@
# Notes: Nightly Duplicati verification is performed by codex_appliance against the Duplicati API because HA backup entities are not available.
# Notes: Monthly HA log hygiene review requests Telegram + GitHub issue follow-up only; Joanna must wait for approval before any changes.
# Notes: Numeric WAN telemetry exposes state_class so recorder can keep long-term statistics.
# Notes: Docker host root disk usage uses Glances-backed normalized sensors; raw Glances sensors are recorder/logbook-filtered.
######################################################################
input_text:
docker_17_disk_pressure_band:
name: "docker_17 disk pressure band"
max: 20
docker_14_disk_pressure_band:
name: "docker_14 disk pressure band"
max: 20
docker_69_disk_pressure_band:
name: "docker_69 disk pressure band"
max: 20
command_line:
- sensor:
name: Infra WAN Packet Loss
@ -58,6 +70,30 @@ template:
{{ fallback }}
{% endif %}
- name: "docker_17 Disk Used Percentage"
unique_id: docker_17_disk_used_percentage
unit_of_measurement: "%"
state_class: measurement
icon: mdi:harddisk
availability: "{{ states('sensor.192_168_10_17_disk_usage') not in ['unknown', 'unavailable', 'none', ''] }}"
state: "{{ states('sensor.192_168_10_17_disk_usage') | float(0) | round(1) }}"
- name: "docker_14 Disk Used Percentage"
unique_id: docker_14_disk_used_percentage
unit_of_measurement: "%"
state_class: measurement
icon: mdi:harddisk
availability: "{{ states('sensor.docker14_disk_usage') not in ['unknown', 'unavailable', 'none', ''] }}"
state: "{{ states('sensor.docker14_disk_usage') | float(0) | round(1) }}"
- name: "docker_69 Disk Used Percentage"
unique_id: docker_69_disk_used_percentage
unit_of_measurement: "%"
state_class: measurement
icon: mdi:harddisk
availability: "{{ states('sensor.docker69_disk_usage') not in ['unknown', 'unavailable', 'none', ''] }}"
state: "{{ states('sensor.docker69_disk_usage') | float(0) | round(1) }}"
- name: "Infra Domain Expiry Min Days"
unique_id: infra_domain_expiry_min_days
unit_of_measurement: "d"
@ -334,6 +370,199 @@ automation:
data:
issue_id: infra_website_latency_degraded
- alias: "Docker Host Disk Pressure Monitor"
id: docker_host_disk_pressure_monitor
description: "Track Docker host root disk pressure from normalized Glances sensors and dispatch Joanna on band changes."
mode: queued
trigger:
- platform: time_pattern
minutes: "/15"
- platform: state
entity_id:
- sensor.docker_17_disk_used_percentage
- sensor.docker_14_disk_used_percentage
- sensor.docker_69_disk_used_percentage
variables:
host_configs:
- host_id: docker_17
host_name: docker_17
disk_entity: sensor.docker_17_disk_used_percentage
raw_entity: sensor.192_168_10_17_disk_usage
free_entity: sensor.192_168_10_17_disk_free
used_entity: sensor.192_168_10_17_disk_used
band_entity: input_text.docker_17_disk_pressure_band
issue_id: docker_host_docker_17_disk_pressure
- host_id: docker_14
host_name: docker_14
disk_entity: sensor.docker_14_disk_used_percentage
raw_entity: sensor.docker14_disk_usage
free_entity: sensor.docker14_disk_free
used_entity: sensor.docker14_disk_used
band_entity: input_text.docker_14_disk_pressure_band
issue_id: docker_host_docker_14_disk_pressure
- host_id: docker_69
host_name: docker_69
disk_entity: sensor.docker_69_disk_used_percentage
raw_entity: sensor.docker69_disk_usage
free_entity: sensor.docker69_disk_free
used_entity: sensor.docker69_disk_used
band_entity: input_text.docker_69_disk_pressure_band
issue_id: docker_host_docker_69_disk_pressure
action:
- repeat:
for_each: "{{ host_configs }}"
sequence:
- variables:
host_id: "{{ repeat.item.host_id }}"
host_name: "{{ repeat.item.host_name }}"
disk_entity: "{{ repeat.item.disk_entity }}"
raw_entity: "{{ repeat.item.raw_entity }}"
free_entity: "{{ repeat.item.free_entity }}"
used_entity: "{{ repeat.item.used_entity }}"
band_entity: "{{ repeat.item.band_entity }}"
issue_id: "{{ repeat.item.issue_id }}"
disk_state: "{{ states(disk_entity) }}"
disk_pct: "{{ disk_state | float(0) }}"
previous_band: "{{ states(band_entity) | lower }}"
current_band: >-
{{ 'unavailable' if disk_state in ['unknown', 'unavailable', 'none', '']
else 'critical' if disk_pct >= 90
else 'warning' if disk_pct >= 80
else 'normal' }}
- choose:
- conditions: "{{ current_band == 'critical' and previous_band != 'critical' }}"
sequence:
- service: repairs.create
data:
issue_id: "{{ issue_id }}"
severity: error
persistent: true
title: "{{ host_name }} disk pressure critical ({{ disk_pct | round(1) }}%)"
description: >-
{{ host_name }} root disk usage is critically high.
Free space or expand the host filesystem before Docker workloads fail.
- service: script.joanna_dispatch
data:
trigger_context: "HA automation docker_host_disk_pressure_monitor (Docker Host Disk Pressure Monitor - Critical)"
source: "home_assistant_automation.docker_host_disk_pressure_monitor.critical"
summary: "{{ host_name }} root disk pressure is critical at {{ disk_pct | round(1) }}%"
entity_ids:
- "{{ disk_entity }}"
- "{{ raw_entity }}"
- "{{ free_entity }}"
- "{{ used_entity }}"
diagnostics: >-
issue_id={{ issue_id }},
host_id={{ host_id }},
disk_entity={{ disk_entity }},
raw_entity={{ raw_entity }},
disk_pct={{ disk_pct | round(1) }},
disk_free={{ states(free_entity) }},
disk_used={{ states(used_entity) }},
threshold=90
request: >-
Investigate critical disk pressure on {{ host_name }} and recommend safe remediation.
Check Docker build cache, image/container volumes, logs, backups, and large files first.
Do not delete data, prune containers, or reboot the host unless explicitly requested.
- service: script.send_to_logbook
data:
topic: "DOCKER"
message: >-
{{ host_name }} disk usage is critical at {{ disk_pct | round(1) }}%.
Repair {{ issue_id }} opened and Joanna investigation requested.
- service: input_text.set_value
target:
entity_id: "{{ band_entity }}"
data:
value: "critical"
- conditions: "{{ current_band == 'warning' and previous_band not in ['warning', 'critical'] }}"
sequence:
- service: repairs.create
data:
issue_id: "{{ issue_id }}"
severity: warning
persistent: true
title: "{{ host_name }} disk pressure warning ({{ disk_pct | round(1) }}%)"
description: >-
{{ host_name }} root disk usage is elevated.
Plan cleanup before capacity reaches critical levels.
- service: script.joanna_dispatch
data:
trigger_context: "HA automation docker_host_disk_pressure_monitor (Docker Host Disk Pressure Monitor - Warning)"
source: "home_assistant_automation.docker_host_disk_pressure_monitor.warning"
summary: "{{ host_name }} root disk pressure warning at {{ disk_pct | round(1) }}%"
entity_ids:
- "{{ disk_entity }}"
- "{{ raw_entity }}"
- "{{ free_entity }}"
- "{{ used_entity }}"
diagnostics: >-
issue_id={{ issue_id }},
host_id={{ host_id }},
disk_entity={{ disk_entity }},
raw_entity={{ raw_entity }},
disk_pct={{ disk_pct | round(1) }},
disk_free={{ states(free_entity) }},
disk_used={{ states(used_entity) }},
threshold=80
request: >-
Investigate elevated disk usage on {{ host_name }} and recommend safe cleanup before it becomes critical.
Check Docker build cache, image/container volumes, logs, backups, and large files first.
Do not delete data, prune containers, or reboot the host unless explicitly requested.
- service: script.send_to_logbook
data:
topic: "DOCKER"
message: >-
{{ host_name }} disk usage warning at {{ disk_pct | round(1) }}%.
Repair {{ issue_id }} opened and Joanna investigation requested.
- service: input_text.set_value
target:
entity_id: "{{ band_entity }}"
data:
value: "warning"
- conditions: "{{ current_band == 'warning' and previous_band == 'critical' }}"
sequence:
- service: repairs.create
data:
issue_id: "{{ issue_id }}"
severity: warning
persistent: true
title: "{{ host_name }} disk pressure warning ({{ disk_pct | round(1) }}%)"
description: >-
{{ host_name }} root disk usage is elevated but no longer critical.
Continue cleanup before capacity reaches critical levels again.
- service: script.send_to_logbook
data:
topic: "DOCKER"
message: "{{ host_name }} disk usage dropped from critical to warning at {{ disk_pct | round(1) }}%."
- service: input_text.set_value
target:
entity_id: "{{ band_entity }}"
data:
value: "warning"
- conditions: "{{ current_band == 'normal' and previous_band in ['warning', 'critical'] }}"
sequence:
- service: repairs.remove
continue_on_error: true
data:
issue_id: "{{ issue_id }}"
- service: script.send_to_logbook
data:
topic: "DOCKER"
message: "{{ host_name }} disk usage recovered to {{ disk_pct | round(1) }}%. Repair {{ issue_id }} cleared."
- service: input_text.set_value
target:
entity_id: "{{ band_entity }}"
data:
value: "normal"
- conditions: "{{ current_band == 'normal' and previous_band not in ['normal', 'warning', 'critical'] }}"
sequence:
- service: input_text.set_value
target:
entity_id: "{{ band_entity }}"
data:
value: "normal"
- alias: "Infrastructure - Backup Nightly Verification"
id: infra_backup_nightly_verification
description: "Use codex_appliance to verify the latest Duplicati run and dispatch Joanna only on failure."

@ -6,7 +6,7 @@
# Recorder Configuration - database retention and exclusions
# Stores HA history while purging noise and controlling DB size.
# -------------------------------------------------------------------
# Notes: Keeps 180 days (1/2 year); excludes vcloudinfo pings, noisy connectivity telemetry, countdown-style alarm helpers, MariaDB snapshot helpers, and other high-churn entities; MariaDB via recorder_db_url.
# Notes: Keeps 180 days (1/2 year); excludes vcloudinfo pings, noisy connectivity telemetry, countdown-style alarm helpers, MariaDB snapshot helpers, raw Glances host telemetry, and other high-churn entities; MariaDB via recorder_db_url.
######################################################################
db_url: !secret recorder_db_url
purge_keep_days: 180
@ -60,6 +60,9 @@ exclude:
- sensor.*_temperature_state
- sensor.*_humidity_state
- sensor.*_last_seen*
- sensor.192_168_10_17_*
- sensor.docker14_*
- sensor.docker69_*
- switch.*_do_not_disturb_*
- switch.*_repeat_switch
- input_text.l10s_vacuum_*

@ -61,6 +61,7 @@ Current automations that kick off automated resolutions (via `script.joanna_disp
| `infra_monthly_log_hygiene_review` | Infrastructure - Monthly HA Log Hygiene Review | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) |
| `docker_state_sync_repairs_dynamic` | Docker State Sync - Repairs (Dynamic) | [../packages/docker_infrastructure.yaml](../packages/docker_infrastructure.yaml) |
| `docker_group_reconcile_weekly_joanna_review` | Docker Group Reconcile - Weekly Joanna Review | [../packages/docker_infrastructure.yaml](../packages/docker_infrastructure.yaml) |
| `docker_host_disk_pressure_monitor` | Docker Host Disk Pressure Monitor | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) |
| `tugtainer_dispatch_joanna_for_available_updates` | Tugtainer - Dispatch Joanna For Available Updates | [../packages/tugtainer_updates.yaml](../packages/tugtainer_updates.yaml) |
| `tugtainer_dispatch_joanna_for_home_assistant_core_digest` | Tugtainer - Dispatch Joanna For Home Assistant Core Digest | [../packages/tugtainer_updates.yaml](../packages/tugtainer_updates.yaml) |
| `unifi_ap_no_clients_repair_combined` | Unifi AP Create Repair Issue after 5m of 0 Clients | [../packages/wireless.yaml](../packages/wireless.yaml) |

Loading…
Cancel
Save

Powered by TurnKey Linux.