Add Docker host disk pressure monitoring

1 month ago · 0b073c156a
parent 84cec3fff9
commit 0b073c156a
6 changed files with 248 additions and 11 deletions
--- a/config/logbook.yaml
+++ b/config/logbook.yaml
@ -6,7 +6,7 @@
 # Logbook Configuration - Activity/Logbook display controls
 #  Defines what is hidden from the Activity/logbook view to keep noise down.
 # -------------------------------------------------------------------
-# Notes: Filters vcloudinfo availability chatter plus location/weather noise.
+# Notes: Filters vcloudinfo availability chatter plus location/weather noise and raw Glances host telemetry.
 ######################################################################

 exclude:
@ -35,6 +35,11 @@ exclude:
    - sensor.*_activity
    - sensor.*_bssid
    - sensor.*_wifi_signal_strength
+    - sensor.192_168_10_17_*
+    - sensor.docker14_*
+    - sensor.docker69_*
+    - sensor.docker_*_disk_used_percentage
+    - input_text.docker_*_disk_pressure_band
    - switch.*_container
    - "*alarm_panel_1*"
    - "*alarm_panel_2*"
--- a/config/packages/README.md
+++ b/config/packages/README.md
@ -46,11 +46,11 @@ Live collection of plug-and-play Home Assistant packages. Each YAML file in this
 | [lightning.yaml](lightning.yaml) | Blitzortung lightning counter monitoring with snoozeable push actions. | `sensor.blitzortung_lightning_counter`, `input_boolean.snooze_lightning`, notify engine actions |
 | [logbook_activity_feed.yaml](logbook_activity_feed.yaml) | Dummy `sensor.activity_feed` + helper to write clean Activity entries (Issue #1550). | `sensor.activity_feed`, `script.send_to_logbook` |
 | [mariadb_monitoring.yaml](mariadb_monitoring.yaml) | MariaDB health sensors and Lovelace dashboard snippet for recorder stats. | `sensor.mariadb_status`, `sensor.database_size` |
-| [docker_infrastructure.yaml](docker_infrastructure.yaml) | Docker host patching telemetry + container/stack Repairs automation, 20-minute Joanna escalation for persistent container outages using stable configured monitor membership, and weekly scheduled prune actions across docker_10/14/17/69. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `script.joanna_dispatch` |
+| [docker_infrastructure.yaml](docker_infrastructure.yaml) | Docker host patching telemetry, container/stack Repairs automation, 20-minute Joanna escalation for persistent container outages using stable configured monitor membership, and weekly scheduled prune actions across docker_10/14/17/69. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `script.joanna_dispatch` |
 | [github_watched_repo_scout.yaml](github_watched_repo_scout.yaml) | Nightly Joanna dispatch that reviews unread notifications from watched GitHub repos, recommends HA-config ideas, refreshes strong-candidate issues, and marks processed watched-repo notifications read. | `automation.github_watched_repo_scout_nightly`, `script.joanna_dispatch`, `script.send_to_logbook` |
 | [proxmox.yaml](proxmox.yaml) | Proxmox runtime and disk pressure monitoring with Repairs + Joanna dispatch for sustained node degradations, plus nightly Frigate reboot. | `binary_sensor.proxmox*_runtime_healthy`, `sensor.proxmox*_disk_used_percentage`, `repairs.create`, `script.joanna_dispatch`, `button.qemu_docker2_101_reboot` |
 | [synology_dsm.yaml](synology_dsm.yaml) | Synology DSM integration health normalization for Carlo-NAS01 and Carlo-NVR, with Repairs + Joanna dispatch on sustained integration, security, or storage problems. | `binary_sensor.carlo_*_synology_problem`, `sensor.carlo_*_synology_problem_summary`, `repairs.create`, `script.joanna_dispatch` |
-| [infrastructure.yaml](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health + website uptime/latency SLO signals for Infrastructure dashboards, plus nightly backup verification and monthly Joanna HA log hygiene review with GitHub issue follow-up. | `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `automation.infra_monthly_log_hygiene_review`, `script.joanna_dispatch` |
+| [infrastructure.yaml](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health, Glances-backed Docker host disk pressure, and website uptime/latency SLO signals for Infrastructure dashboards, plus nightly backup verification and monthly Joanna HA log hygiene review with GitHub issue follow-up. | `sensor.docker_*_disk_used_percentage`, `automation.docker_host_disk_pressure_monitor`, `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `script.joanna_dispatch` |
 | [onenote_indexer.yaml](onenote_indexer.yaml) | OneNote indexer health/status monitoring for Joanna, failure-repair automation, and a daily duplicate-delete maintenance request. | `sensor.onenote_indexer_last_job_status`, `binary_sensor.onenote_indexer_last_job_successful` |
 | [mqtt_status.yaml](mqtt_status.yaml) | Command-line MQTT broker reachability probe with Spook Repairs escalation and Joanna troubleshooting dispatch on outage. | `binary_sensor.mqtt_status_raw`, `binary_sensor.mqtt_broker_problem`, `repairs.create`, `rest_command.bearclaw_command` |
 | [mariadb.yaml](mariadb.yaml) | MariaDB recorder health and capacity snapshots with hourly live metrics, weekly admin/recorder polling, and stats-ready numeric sensors. | `sensor.mariadb_status`, `sensor.database_size` |
--- a/config/packages/docker_infrastructure.yaml
+++ b/config/packages/docker_infrastructure.yaml
@ -5,7 +5,7 @@
 # -------------------------------------------------------------------
 # Docker Infrastructure - Host patching and container alerts
 # Related Issue: 1632, 1584
-#  APT webhook results (docker_10/14/17/69) and container down repairs.
+#  APT results and container down repairs.
 # -------------------------------------------------------------------
 # Notes: Hosts run weekly Wed 12:00 APT job and POST JSON to webhooks.
 # Notes: Reboots are handled directly on each host by apt_weekly.sh.
@ -1157,7 +1157,7 @@ automation:
    action:
      - variables:
          down_items: "{{ state_attr('sensor.docker_containers_down_list', 'down_containers') | default([], true) | list }}"
-          down_count: "{{ down_items | count }}"
+          down_count: "{{ states('sensor.docker_containers_down_count') | int(0) }}"
      - service: script.send_to_logbook
        data:
          topic: "DOCKER"
@ -1242,9 +1242,8 @@ automation:
      - platform: time
        at: "03:15:00"
    condition:
-      - condition: time
-        weekday:
-          - sun
+      - condition: template
+        value_template: "{{ now().weekday() == 6 }}"
    action:
      - service: button.press
        target:
--- a/config/packages/infrastructure.yaml
+++ b/config/packages/infrastructure.yaml
@ -3,8 +3,8 @@
 # For more info visit https://www.vcloudinfo.com/click-here
 # Original Repo : https://github.com/CCOSTAN/Home-AssistantConfig
 # -------------------------------------------------------------------
-# Infrastructure - Observability and Joanna review workflows
-#  WAN/DNS/website/domain/cert state normalized for dashboards, plus scheduled infrastructure reviews.
+# Infrastructure - Observability, disk pressure, and Joanna review workflows
+#  WAN/DNS/website/domain/cert/Docker host state normalized for dashboards, plus scheduled infrastructure reviews.
 # -------------------------------------------------------------------
 # Related Issue: 1584
 # Notes: Home dashboard consumes `infra_*` entities for exceptions-only alerts.
@ -12,8 +12,20 @@
 # Notes: Nightly Duplicati verification is performed by codex_appliance against the Duplicati API because HA backup entities are not available.
 # Notes: Monthly HA log hygiene review requests Telegram + GitHub issue follow-up only; Joanna must wait for approval before any changes.
 # Notes: Numeric WAN telemetry exposes state_class so recorder can keep long-term statistics.
+# Notes: Docker host root disk usage uses Glances-backed normalized sensors; raw Glances sensors are recorder/logbook-filtered.
 ######################################################################

+input_text:
+  docker_17_disk_pressure_band:
+    name: "docker_17 disk pressure band"
+    max: 20
+  docker_14_disk_pressure_band:
+    name: "docker_14 disk pressure band"
+    max: 20
+  docker_69_disk_pressure_band:
+    name: "docker_69 disk pressure band"
+    max: 20
+
 command_line:
  - sensor:
      name: Infra WAN Packet Loss
@ -58,6 +70,30 @@ template:
            {{ fallback }}
          {% endif %}

+      - name: "docker_17 Disk Used Percentage"
+        unique_id: docker_17_disk_used_percentage
+        unit_of_measurement: "%"
+        state_class: measurement
+        icon: mdi:harddisk
+        availability: "{{ states('sensor.192_168_10_17_disk_usage') not in ['unknown', 'unavailable', 'none', ''] }}"
+        state: "{{ states('sensor.192_168_10_17_disk_usage') | float(0) | round(1) }}"
+
+      - name: "docker_14 Disk Used Percentage"
+        unique_id: docker_14_disk_used_percentage
+        unit_of_measurement: "%"
+        state_class: measurement
+        icon: mdi:harddisk
+        availability: "{{ states('sensor.docker14_disk_usage') not in ['unknown', 'unavailable', 'none', ''] }}"
+        state: "{{ states('sensor.docker14_disk_usage') | float(0) | round(1) }}"
+
+      - name: "docker_69 Disk Used Percentage"
+        unique_id: docker_69_disk_used_percentage
+        unit_of_measurement: "%"
+        state_class: measurement
+        icon: mdi:harddisk
+        availability: "{{ states('sensor.docker69_disk_usage') not in ['unknown', 'unavailable', 'none', ''] }}"
+        state: "{{ states('sensor.docker69_disk_usage') | float(0) | round(1) }}"
+
      - name: "Infra Domain Expiry Min Days"
        unique_id: infra_domain_expiry_min_days
        unit_of_measurement: "d"
@ -334,6 +370,199 @@ automation:
            data:
              issue_id: infra_website_latency_degraded

+  - alias: "Docker Host Disk Pressure Monitor"
+    id: docker_host_disk_pressure_monitor
+    description: "Track Docker host root disk pressure from normalized Glances sensors and dispatch Joanna on band changes."
+    mode: queued
+    trigger:
+      - platform: time_pattern
+        minutes: "/15"
+      - platform: state
+        entity_id:
+          - sensor.docker_17_disk_used_percentage
+          - sensor.docker_14_disk_used_percentage
+          - sensor.docker_69_disk_used_percentage
+    variables:
+      host_configs:
+        - host_id: docker_17
+          host_name: docker_17
+          disk_entity: sensor.docker_17_disk_used_percentage
+          raw_entity: sensor.192_168_10_17_disk_usage
+          free_entity: sensor.192_168_10_17_disk_free
+          used_entity: sensor.192_168_10_17_disk_used
+          band_entity: input_text.docker_17_disk_pressure_band
+          issue_id: docker_host_docker_17_disk_pressure
+        - host_id: docker_14
+          host_name: docker_14
+          disk_entity: sensor.docker_14_disk_used_percentage
+          raw_entity: sensor.docker14_disk_usage
+          free_entity: sensor.docker14_disk_free
+          used_entity: sensor.docker14_disk_used
+          band_entity: input_text.docker_14_disk_pressure_band
+          issue_id: docker_host_docker_14_disk_pressure
+        - host_id: docker_69
+          host_name: docker_69
+          disk_entity: sensor.docker_69_disk_used_percentage
+          raw_entity: sensor.docker69_disk_usage
+          free_entity: sensor.docker69_disk_free
+          used_entity: sensor.docker69_disk_used
+          band_entity: input_text.docker_69_disk_pressure_band
+          issue_id: docker_host_docker_69_disk_pressure
+    action:
+      - repeat:
+          for_each: "{{ host_configs }}"
+          sequence:
+            - variables:
+                host_id: "{{ repeat.item.host_id }}"
+                host_name: "{{ repeat.item.host_name }}"
+                disk_entity: "{{ repeat.item.disk_entity }}"
+                raw_entity: "{{ repeat.item.raw_entity }}"
+                free_entity: "{{ repeat.item.free_entity }}"
+                used_entity: "{{ repeat.item.used_entity }}"
+                band_entity: "{{ repeat.item.band_entity }}"
+                issue_id: "{{ repeat.item.issue_id }}"
+                disk_state: "{{ states(disk_entity) }}"
+                disk_pct: "{{ disk_state | float(0) }}"
+                previous_band: "{{ states(band_entity) | lower }}"
+                current_band: >-
+                  {{ 'unavailable' if disk_state in ['unknown', 'unavailable', 'none', '']
+                     else 'critical' if disk_pct >= 90
+                     else 'warning' if disk_pct >= 80
+                     else 'normal' }}
+            - choose:
+                - conditions: "{{ current_band == 'critical' and previous_band != 'critical' }}"
+                  sequence:
+                    - service: repairs.create
+                      data:
+                        issue_id: "{{ issue_id }}"
+                        severity: error
+                        persistent: true
+                        title: "{{ host_name }} disk pressure critical ({{ disk_pct | round(1) }}%)"
+                        description: >-
+                          {{ host_name }} root disk usage is critically high.
+                          Free space or expand the host filesystem before Docker workloads fail.
+                    - service: script.joanna_dispatch
+                      data:
+                        trigger_context: "HA automation docker_host_disk_pressure_monitor (Docker Host Disk Pressure Monitor - Critical)"
+                        source: "home_assistant_automation.docker_host_disk_pressure_monitor.critical"
+                        summary: "{{ host_name }} root disk pressure is critical at {{ disk_pct | round(1) }}%"
+                        entity_ids:
+                          - "{{ disk_entity }}"
+                          - "{{ raw_entity }}"
+                          - "{{ free_entity }}"
+                          - "{{ used_entity }}"
+                        diagnostics: >-
+                          issue_id={{ issue_id }},
+                          host_id={{ host_id }},
+                          disk_entity={{ disk_entity }},
+                          raw_entity={{ raw_entity }},
+                          disk_pct={{ disk_pct | round(1) }},
+                          disk_free={{ states(free_entity) }},
+                          disk_used={{ states(used_entity) }},
+                          threshold=90
+                        request: >-
+                          Investigate critical disk pressure on {{ host_name }} and recommend safe remediation.
+                          Check Docker build cache, image/container volumes, logs, backups, and large files first.
+                          Do not delete data, prune containers, or reboot the host unless explicitly requested.
+                    - service: script.send_to_logbook
+                      data:
+                        topic: "DOCKER"
+                        message: >-
+                          {{ host_name }} disk usage is critical at {{ disk_pct | round(1) }}%.
+                          Repair {{ issue_id }} opened and Joanna investigation requested.
+                    - service: input_text.set_value
+                      target:
+                        entity_id: "{{ band_entity }}"
+                      data:
+                        value: "critical"
+                - conditions: "{{ current_band == 'warning' and previous_band not in ['warning', 'critical'] }}"
+                  sequence:
+                    - service: repairs.create
+                      data:
+                        issue_id: "{{ issue_id }}"
+                        severity: warning
+                        persistent: true
+                        title: "{{ host_name }} disk pressure warning ({{ disk_pct | round(1) }}%)"
+                        description: >-
+                          {{ host_name }} root disk usage is elevated.
+                          Plan cleanup before capacity reaches critical levels.
+                    - service: script.joanna_dispatch
+                      data:
+                        trigger_context: "HA automation docker_host_disk_pressure_monitor (Docker Host Disk Pressure Monitor - Warning)"
+                        source: "home_assistant_automation.docker_host_disk_pressure_monitor.warning"
+                        summary: "{{ host_name }} root disk pressure warning at {{ disk_pct | round(1) }}%"
+                        entity_ids:
+                          - "{{ disk_entity }}"
+                          - "{{ raw_entity }}"
+                          - "{{ free_entity }}"
+                          - "{{ used_entity }}"
+                        diagnostics: >-
+                          issue_id={{ issue_id }},
+                          host_id={{ host_id }},
+                          disk_entity={{ disk_entity }},
+                          raw_entity={{ raw_entity }},
+                          disk_pct={{ disk_pct | round(1) }},
+                          disk_free={{ states(free_entity) }},
+                          disk_used={{ states(used_entity) }},
+                          threshold=80
+                        request: >-
+                          Investigate elevated disk usage on {{ host_name }} and recommend safe cleanup before it becomes critical.
+                          Check Docker build cache, image/container volumes, logs, backups, and large files first.
+                          Do not delete data, prune containers, or reboot the host unless explicitly requested.
+                    - service: script.send_to_logbook
+                      data:
+                        topic: "DOCKER"
+                        message: >-
+                          {{ host_name }} disk usage warning at {{ disk_pct | round(1) }}%.
+                          Repair {{ issue_id }} opened and Joanna investigation requested.
+                    - service: input_text.set_value
+                      target:
+                        entity_id: "{{ band_entity }}"
+                      data:
+                        value: "warning"
+                - conditions: "{{ current_band == 'warning' and previous_band == 'critical' }}"
+                  sequence:
+                    - service: repairs.create
+                      data:
+                        issue_id: "{{ issue_id }}"
+                        severity: warning
+                        persistent: true
+                        title: "{{ host_name }} disk pressure warning ({{ disk_pct | round(1) }}%)"
+                        description: >-
+                          {{ host_name }} root disk usage is elevated but no longer critical.
+                          Continue cleanup before capacity reaches critical levels again.
+                    - service: script.send_to_logbook
+                      data:
+                        topic: "DOCKER"
+                        message: "{{ host_name }} disk usage dropped from critical to warning at {{ disk_pct | round(1) }}%."
+                    - service: input_text.set_value
+                      target:
+                        entity_id: "{{ band_entity }}"
+                      data:
+                        value: "warning"
+                - conditions: "{{ current_band == 'normal' and previous_band in ['warning', 'critical'] }}"
+                  sequence:
+                    - service: repairs.remove
+                      continue_on_error: true
+                      data:
+                        issue_id: "{{ issue_id }}"
+                    - service: script.send_to_logbook
+                      data:
+                        topic: "DOCKER"
+                        message: "{{ host_name }} disk usage recovered to {{ disk_pct | round(1) }}%. Repair {{ issue_id }} cleared."
+                    - service: input_text.set_value
+                      target:
+                        entity_id: "{{ band_entity }}"
+                      data:
+                        value: "normal"
+                - conditions: "{{ current_band == 'normal' and previous_band not in ['normal', 'warning', 'critical'] }}"
+                  sequence:
+                    - service: input_text.set_value
+                      target:
+                        entity_id: "{{ band_entity }}"
+                      data:
+                        value: "normal"
+
  - alias: "Infrastructure - Backup Nightly Verification"
    id: infra_backup_nightly_verification
    description: "Use codex_appliance to verify the latest Duplicati run and dispatch Joanna only on failure."
--- a/config/recorder.yaml
+++ b/config/recorder.yaml
@ -6,7 +6,7 @@
 # Recorder Configuration - database retention and exclusions
 #  Stores HA history while purging noise and controlling DB size.
 # -------------------------------------------------------------------
-# Notes: Keeps 180 days (1/2 year); excludes vcloudinfo pings, noisy connectivity telemetry, countdown-style alarm helpers, MariaDB snapshot helpers, and other high-churn entities; MariaDB via recorder_db_url.
+# Notes: Keeps 180 days (1/2 year); excludes vcloudinfo pings, noisy connectivity telemetry, countdown-style alarm helpers, MariaDB snapshot helpers, raw Glances host telemetry, and other high-churn entities; MariaDB via recorder_db_url.
 ######################################################################
 db_url: !secret recorder_db_url
 purge_keep_days: 180
@ -60,6 +60,9 @@ exclude:
    - sensor.*_temperature_state
    - sensor.*_humidity_state
    - sensor.*_last_seen*
+    - sensor.192_168_10_17_*
+    - sensor.docker14_*
+    - sensor.docker69_*
    - switch.*_do_not_disturb_*
    - switch.*_repeat_switch
    - input_text.l10s_vacuum_*
--- a/config/script/README.md
+++ b/config/script/README.md
@ -61,6 +61,7 @@ Current automations that kick off automated resolutions (via `script.joanna_disp
 | `infra_monthly_log_hygiene_review` | Infrastructure - Monthly HA Log Hygiene Review | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) |
 | `docker_state_sync_repairs_dynamic` | Docker State Sync - Repairs (Dynamic) | [../packages/docker_infrastructure.yaml](../packages/docker_infrastructure.yaml) |
 | `docker_group_reconcile_weekly_joanna_review` | Docker Group Reconcile - Weekly Joanna Review | [../packages/docker_infrastructure.yaml](../packages/docker_infrastructure.yaml) |
+| `docker_host_disk_pressure_monitor` | Docker Host Disk Pressure Monitor | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) |
 | `tugtainer_dispatch_joanna_for_available_updates` | Tugtainer - Dispatch Joanna For Available Updates | [../packages/tugtainer_updates.yaml](../packages/tugtainer_updates.yaml) |
 | `tugtainer_dispatch_joanna_for_home_assistant_core_digest` | Tugtainer - Dispatch Joanna For Home Assistant Core Digest | [../packages/tugtainer_updates.yaml](../packages/tugtainer_updates.yaml) |
 | `unifi_ap_no_clients_repair_combined` | Unifi AP Create Repair Issue after 5m of 0 Clients | [../packages/wireless.yaml](../packages/wireless.yaml) |