diff --git a/config/logbook.yaml b/config/logbook.yaml index 91e20d56..dea28fe7 100644 --- a/config/logbook.yaml +++ b/config/logbook.yaml @@ -6,7 +6,7 @@ # Logbook Configuration - Activity/Logbook display controls # Defines what is hidden from the Activity/logbook view to keep noise down. # ------------------------------------------------------------------- -# Notes: Filters vcloudinfo availability chatter plus location/weather noise. +# Notes: Filters vcloudinfo availability chatter plus location/weather noise and raw Glances host telemetry. ###################################################################### exclude: @@ -35,6 +35,11 @@ exclude: - sensor.*_activity - sensor.*_bssid - sensor.*_wifi_signal_strength + - sensor.192_168_10_17_* + - sensor.docker14_* + - sensor.docker69_* + - sensor.docker_*_disk_used_percentage + - input_text.docker_*_disk_pressure_band - switch.*_container - "*alarm_panel_1*" - "*alarm_panel_2*" diff --git a/config/packages/README.md b/config/packages/README.md index e15540e3..8cf57175 100755 --- a/config/packages/README.md +++ b/config/packages/README.md @@ -46,11 +46,11 @@ Live collection of plug-and-play Home Assistant packages. Each YAML file in this | [lightning.yaml](lightning.yaml) | Blitzortung lightning counter monitoring with snoozeable push actions. | `sensor.blitzortung_lightning_counter`, `input_boolean.snooze_lightning`, notify engine actions | | [logbook_activity_feed.yaml](logbook_activity_feed.yaml) | Dummy `sensor.activity_feed` + helper to write clean Activity entries (Issue #1550). | `sensor.activity_feed`, `script.send_to_logbook` | | [mariadb_monitoring.yaml](mariadb_monitoring.yaml) | MariaDB health sensors and Lovelace dashboard snippet for recorder stats. | `sensor.mariadb_status`, `sensor.database_size` | -| [docker_infrastructure.yaml](docker_infrastructure.yaml) | Docker host patching telemetry + container/stack Repairs automation, 20-minute Joanna escalation for persistent container outages using stable configured monitor membership, and weekly scheduled prune actions across docker_10/14/17/69. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `script.joanna_dispatch` | +| [docker_infrastructure.yaml](docker_infrastructure.yaml) | Docker host patching telemetry, container/stack Repairs automation, 20-minute Joanna escalation for persistent container outages using stable configured monitor membership, and weekly scheduled prune actions across docker_10/14/17/69. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `script.joanna_dispatch` | | [github_watched_repo_scout.yaml](github_watched_repo_scout.yaml) | Nightly Joanna dispatch that reviews unread notifications from watched GitHub repos, recommends HA-config ideas, refreshes strong-candidate issues, and marks processed watched-repo notifications read. | `automation.github_watched_repo_scout_nightly`, `script.joanna_dispatch`, `script.send_to_logbook` | | [proxmox.yaml](proxmox.yaml) | Proxmox runtime and disk pressure monitoring with Repairs + Joanna dispatch for sustained node degradations, plus nightly Frigate reboot. | `binary_sensor.proxmox*_runtime_healthy`, `sensor.proxmox*_disk_used_percentage`, `repairs.create`, `script.joanna_dispatch`, `button.qemu_docker2_101_reboot` | | [synology_dsm.yaml](synology_dsm.yaml) | Synology DSM integration health normalization for Carlo-NAS01 and Carlo-NVR, with Repairs + Joanna dispatch on sustained integration, security, or storage problems. | `binary_sensor.carlo_*_synology_problem`, `sensor.carlo_*_synology_problem_summary`, `repairs.create`, `script.joanna_dispatch` | -| [infrastructure.yaml](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health + website uptime/latency SLO signals for Infrastructure dashboards, plus nightly backup verification and monthly Joanna HA log hygiene review with GitHub issue follow-up. | `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `automation.infra_monthly_log_hygiene_review`, `script.joanna_dispatch` | +| [infrastructure.yaml](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health, Glances-backed Docker host disk pressure, and website uptime/latency SLO signals for Infrastructure dashboards, plus nightly backup verification and monthly Joanna HA log hygiene review with GitHub issue follow-up. | `sensor.docker_*_disk_used_percentage`, `automation.docker_host_disk_pressure_monitor`, `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `script.joanna_dispatch` | | [onenote_indexer.yaml](onenote_indexer.yaml) | OneNote indexer health/status monitoring for Joanna, failure-repair automation, and a daily duplicate-delete maintenance request. | `sensor.onenote_indexer_last_job_status`, `binary_sensor.onenote_indexer_last_job_successful` | | [mqtt_status.yaml](mqtt_status.yaml) | Command-line MQTT broker reachability probe with Spook Repairs escalation and Joanna troubleshooting dispatch on outage. | `binary_sensor.mqtt_status_raw`, `binary_sensor.mqtt_broker_problem`, `repairs.create`, `rest_command.bearclaw_command` | | [mariadb.yaml](mariadb.yaml) | MariaDB recorder health and capacity snapshots with hourly live metrics, weekly admin/recorder polling, and stats-ready numeric sensors. | `sensor.mariadb_status`, `sensor.database_size` | diff --git a/config/packages/docker_infrastructure.yaml b/config/packages/docker_infrastructure.yaml index ecc70560..a7b4a4fa 100644 --- a/config/packages/docker_infrastructure.yaml +++ b/config/packages/docker_infrastructure.yaml @@ -5,7 +5,7 @@ # ------------------------------------------------------------------- # Docker Infrastructure - Host patching and container alerts # Related Issue: 1632, 1584 -# APT webhook results (docker_10/14/17/69) and container down repairs. +# APT results and container down repairs. # ------------------------------------------------------------------- # Notes: Hosts run weekly Wed 12:00 APT job and POST JSON to webhooks. # Notes: Reboots are handled directly on each host by apt_weekly.sh. @@ -1157,7 +1157,7 @@ automation: action: - variables: down_items: "{{ state_attr('sensor.docker_containers_down_list', 'down_containers') | default([], true) | list }}" - down_count: "{{ down_items | count }}" + down_count: "{{ states('sensor.docker_containers_down_count') | int(0) }}" - service: script.send_to_logbook data: topic: "DOCKER" @@ -1242,9 +1242,8 @@ automation: - platform: time at: "03:15:00" condition: - - condition: time - weekday: - - sun + - condition: template + value_template: "{{ now().weekday() == 6 }}" action: - service: button.press target: diff --git a/config/packages/infrastructure.yaml b/config/packages/infrastructure.yaml index b581ab33..67b91168 100644 --- a/config/packages/infrastructure.yaml +++ b/config/packages/infrastructure.yaml @@ -3,8 +3,8 @@ # For more info visit https://www.vcloudinfo.com/click-here # Original Repo : https://github.com/CCOSTAN/Home-AssistantConfig # ------------------------------------------------------------------- -# Infrastructure - Observability and Joanna review workflows -# WAN/DNS/website/domain/cert state normalized for dashboards, plus scheduled infrastructure reviews. +# Infrastructure - Observability, disk pressure, and Joanna review workflows +# WAN/DNS/website/domain/cert/Docker host state normalized for dashboards, plus scheduled infrastructure reviews. # ------------------------------------------------------------------- # Related Issue: 1584 # Notes: Home dashboard consumes `infra_*` entities for exceptions-only alerts. @@ -12,8 +12,20 @@ # Notes: Nightly Duplicati verification is performed by codex_appliance against the Duplicati API because HA backup entities are not available. # Notes: Monthly HA log hygiene review requests Telegram + GitHub issue follow-up only; Joanna must wait for approval before any changes. # Notes: Numeric WAN telemetry exposes state_class so recorder can keep long-term statistics. +# Notes: Docker host root disk usage uses Glances-backed normalized sensors; raw Glances sensors are recorder/logbook-filtered. ###################################################################### +input_text: + docker_17_disk_pressure_band: + name: "docker_17 disk pressure band" + max: 20 + docker_14_disk_pressure_band: + name: "docker_14 disk pressure band" + max: 20 + docker_69_disk_pressure_band: + name: "docker_69 disk pressure band" + max: 20 + command_line: - sensor: name: Infra WAN Packet Loss @@ -58,6 +70,30 @@ template: {{ fallback }} {% endif %} + - name: "docker_17 Disk Used Percentage" + unique_id: docker_17_disk_used_percentage + unit_of_measurement: "%" + state_class: measurement + icon: mdi:harddisk + availability: "{{ states('sensor.192_168_10_17_disk_usage') not in ['unknown', 'unavailable', 'none', ''] }}" + state: "{{ states('sensor.192_168_10_17_disk_usage') | float(0) | round(1) }}" + + - name: "docker_14 Disk Used Percentage" + unique_id: docker_14_disk_used_percentage + unit_of_measurement: "%" + state_class: measurement + icon: mdi:harddisk + availability: "{{ states('sensor.docker14_disk_usage') not in ['unknown', 'unavailable', 'none', ''] }}" + state: "{{ states('sensor.docker14_disk_usage') | float(0) | round(1) }}" + + - name: "docker_69 Disk Used Percentage" + unique_id: docker_69_disk_used_percentage + unit_of_measurement: "%" + state_class: measurement + icon: mdi:harddisk + availability: "{{ states('sensor.docker69_disk_usage') not in ['unknown', 'unavailable', 'none', ''] }}" + state: "{{ states('sensor.docker69_disk_usage') | float(0) | round(1) }}" + - name: "Infra Domain Expiry Min Days" unique_id: infra_domain_expiry_min_days unit_of_measurement: "d" @@ -334,6 +370,199 @@ automation: data: issue_id: infra_website_latency_degraded + - alias: "Docker Host Disk Pressure Monitor" + id: docker_host_disk_pressure_monitor + description: "Track Docker host root disk pressure from normalized Glances sensors and dispatch Joanna on band changes." + mode: queued + trigger: + - platform: time_pattern + minutes: "/15" + - platform: state + entity_id: + - sensor.docker_17_disk_used_percentage + - sensor.docker_14_disk_used_percentage + - sensor.docker_69_disk_used_percentage + variables: + host_configs: + - host_id: docker_17 + host_name: docker_17 + disk_entity: sensor.docker_17_disk_used_percentage + raw_entity: sensor.192_168_10_17_disk_usage + free_entity: sensor.192_168_10_17_disk_free + used_entity: sensor.192_168_10_17_disk_used + band_entity: input_text.docker_17_disk_pressure_band + issue_id: docker_host_docker_17_disk_pressure + - host_id: docker_14 + host_name: docker_14 + disk_entity: sensor.docker_14_disk_used_percentage + raw_entity: sensor.docker14_disk_usage + free_entity: sensor.docker14_disk_free + used_entity: sensor.docker14_disk_used + band_entity: input_text.docker_14_disk_pressure_band + issue_id: docker_host_docker_14_disk_pressure + - host_id: docker_69 + host_name: docker_69 + disk_entity: sensor.docker_69_disk_used_percentage + raw_entity: sensor.docker69_disk_usage + free_entity: sensor.docker69_disk_free + used_entity: sensor.docker69_disk_used + band_entity: input_text.docker_69_disk_pressure_band + issue_id: docker_host_docker_69_disk_pressure + action: + - repeat: + for_each: "{{ host_configs }}" + sequence: + - variables: + host_id: "{{ repeat.item.host_id }}" + host_name: "{{ repeat.item.host_name }}" + disk_entity: "{{ repeat.item.disk_entity }}" + raw_entity: "{{ repeat.item.raw_entity }}" + free_entity: "{{ repeat.item.free_entity }}" + used_entity: "{{ repeat.item.used_entity }}" + band_entity: "{{ repeat.item.band_entity }}" + issue_id: "{{ repeat.item.issue_id }}" + disk_state: "{{ states(disk_entity) }}" + disk_pct: "{{ disk_state | float(0) }}" + previous_band: "{{ states(band_entity) | lower }}" + current_band: >- + {{ 'unavailable' if disk_state in ['unknown', 'unavailable', 'none', ''] + else 'critical' if disk_pct >= 90 + else 'warning' if disk_pct >= 80 + else 'normal' }} + - choose: + - conditions: "{{ current_band == 'critical' and previous_band != 'critical' }}" + sequence: + - service: repairs.create + data: + issue_id: "{{ issue_id }}" + severity: error + persistent: true + title: "{{ host_name }} disk pressure critical ({{ disk_pct | round(1) }}%)" + description: >- + {{ host_name }} root disk usage is critically high. + Free space or expand the host filesystem before Docker workloads fail. + - service: script.joanna_dispatch + data: + trigger_context: "HA automation docker_host_disk_pressure_monitor (Docker Host Disk Pressure Monitor - Critical)" + source: "home_assistant_automation.docker_host_disk_pressure_monitor.critical" + summary: "{{ host_name }} root disk pressure is critical at {{ disk_pct | round(1) }}%" + entity_ids: + - "{{ disk_entity }}" + - "{{ raw_entity }}" + - "{{ free_entity }}" + - "{{ used_entity }}" + diagnostics: >- + issue_id={{ issue_id }}, + host_id={{ host_id }}, + disk_entity={{ disk_entity }}, + raw_entity={{ raw_entity }}, + disk_pct={{ disk_pct | round(1) }}, + disk_free={{ states(free_entity) }}, + disk_used={{ states(used_entity) }}, + threshold=90 + request: >- + Investigate critical disk pressure on {{ host_name }} and recommend safe remediation. + Check Docker build cache, image/container volumes, logs, backups, and large files first. + Do not delete data, prune containers, or reboot the host unless explicitly requested. + - service: script.send_to_logbook + data: + topic: "DOCKER" + message: >- + {{ host_name }} disk usage is critical at {{ disk_pct | round(1) }}%. + Repair {{ issue_id }} opened and Joanna investigation requested. + - service: input_text.set_value + target: + entity_id: "{{ band_entity }}" + data: + value: "critical" + - conditions: "{{ current_band == 'warning' and previous_band not in ['warning', 'critical'] }}" + sequence: + - service: repairs.create + data: + issue_id: "{{ issue_id }}" + severity: warning + persistent: true + title: "{{ host_name }} disk pressure warning ({{ disk_pct | round(1) }}%)" + description: >- + {{ host_name }} root disk usage is elevated. + Plan cleanup before capacity reaches critical levels. + - service: script.joanna_dispatch + data: + trigger_context: "HA automation docker_host_disk_pressure_monitor (Docker Host Disk Pressure Monitor - Warning)" + source: "home_assistant_automation.docker_host_disk_pressure_monitor.warning" + summary: "{{ host_name }} root disk pressure warning at {{ disk_pct | round(1) }}%" + entity_ids: + - "{{ disk_entity }}" + - "{{ raw_entity }}" + - "{{ free_entity }}" + - "{{ used_entity }}" + diagnostics: >- + issue_id={{ issue_id }}, + host_id={{ host_id }}, + disk_entity={{ disk_entity }}, + raw_entity={{ raw_entity }}, + disk_pct={{ disk_pct | round(1) }}, + disk_free={{ states(free_entity) }}, + disk_used={{ states(used_entity) }}, + threshold=80 + request: >- + Investigate elevated disk usage on {{ host_name }} and recommend safe cleanup before it becomes critical. + Check Docker build cache, image/container volumes, logs, backups, and large files first. + Do not delete data, prune containers, or reboot the host unless explicitly requested. + - service: script.send_to_logbook + data: + topic: "DOCKER" + message: >- + {{ host_name }} disk usage warning at {{ disk_pct | round(1) }}%. + Repair {{ issue_id }} opened and Joanna investigation requested. + - service: input_text.set_value + target: + entity_id: "{{ band_entity }}" + data: + value: "warning" + - conditions: "{{ current_band == 'warning' and previous_band == 'critical' }}" + sequence: + - service: repairs.create + data: + issue_id: "{{ issue_id }}" + severity: warning + persistent: true + title: "{{ host_name }} disk pressure warning ({{ disk_pct | round(1) }}%)" + description: >- + {{ host_name }} root disk usage is elevated but no longer critical. + Continue cleanup before capacity reaches critical levels again. + - service: script.send_to_logbook + data: + topic: "DOCKER" + message: "{{ host_name }} disk usage dropped from critical to warning at {{ disk_pct | round(1) }}%." + - service: input_text.set_value + target: + entity_id: "{{ band_entity }}" + data: + value: "warning" + - conditions: "{{ current_band == 'normal' and previous_band in ['warning', 'critical'] }}" + sequence: + - service: repairs.remove + continue_on_error: true + data: + issue_id: "{{ issue_id }}" + - service: script.send_to_logbook + data: + topic: "DOCKER" + message: "{{ host_name }} disk usage recovered to {{ disk_pct | round(1) }}%. Repair {{ issue_id }} cleared." + - service: input_text.set_value + target: + entity_id: "{{ band_entity }}" + data: + value: "normal" + - conditions: "{{ current_band == 'normal' and previous_band not in ['normal', 'warning', 'critical'] }}" + sequence: + - service: input_text.set_value + target: + entity_id: "{{ band_entity }}" + data: + value: "normal" + - alias: "Infrastructure - Backup Nightly Verification" id: infra_backup_nightly_verification description: "Use codex_appliance to verify the latest Duplicati run and dispatch Joanna only on failure." diff --git a/config/recorder.yaml b/config/recorder.yaml index 9d672cff..c5b8d687 100755 --- a/config/recorder.yaml +++ b/config/recorder.yaml @@ -6,7 +6,7 @@ # Recorder Configuration - database retention and exclusions # Stores HA history while purging noise and controlling DB size. # ------------------------------------------------------------------- -# Notes: Keeps 180 days (1/2 year); excludes vcloudinfo pings, noisy connectivity telemetry, countdown-style alarm helpers, MariaDB snapshot helpers, and other high-churn entities; MariaDB via recorder_db_url. +# Notes: Keeps 180 days (1/2 year); excludes vcloudinfo pings, noisy connectivity telemetry, countdown-style alarm helpers, MariaDB snapshot helpers, raw Glances host telemetry, and other high-churn entities; MariaDB via recorder_db_url. ###################################################################### db_url: !secret recorder_db_url purge_keep_days: 180 @@ -60,6 +60,9 @@ exclude: - sensor.*_temperature_state - sensor.*_humidity_state - sensor.*_last_seen* + - sensor.192_168_10_17_* + - sensor.docker14_* + - sensor.docker69_* - switch.*_do_not_disturb_* - switch.*_repeat_switch - input_text.l10s_vacuum_* diff --git a/config/script/README.md b/config/script/README.md index 41ca5d3f..86a44b41 100755 --- a/config/script/README.md +++ b/config/script/README.md @@ -61,6 +61,7 @@ Current automations that kick off automated resolutions (via `script.joanna_disp | `infra_monthly_log_hygiene_review` | Infrastructure - Monthly HA Log Hygiene Review | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) | | `docker_state_sync_repairs_dynamic` | Docker State Sync - Repairs (Dynamic) | [../packages/docker_infrastructure.yaml](../packages/docker_infrastructure.yaml) | | `docker_group_reconcile_weekly_joanna_review` | Docker Group Reconcile - Weekly Joanna Review | [../packages/docker_infrastructure.yaml](../packages/docker_infrastructure.yaml) | +| `docker_host_disk_pressure_monitor` | Docker Host Disk Pressure Monitor | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) | | `tugtainer_dispatch_joanna_for_available_updates` | Tugtainer - Dispatch Joanna For Available Updates | [../packages/tugtainer_updates.yaml](../packages/tugtainer_updates.yaml) | | `tugtainer_dispatch_joanna_for_home_assistant_core_digest` | Tugtainer - Dispatch Joanna For Home Assistant Core Digest | [../packages/tugtainer_updates.yaml](../packages/tugtainer_updates.yaml) | | `unifi_ap_no_clients_repair_combined` | Unifi AP Create Repair Issue after 5m of 0 Clients | [../packages/wireless.yaml](../packages/wireless.yaml) |