From 9ae1ee0968943a9c0a647be5c40780e204afbeb8 Mon Sep 17 00:00:00 2001 From: Carlo Costanzo Date: Thu, 28 May 2026 17:03:25 -0400 Subject: [PATCH] Add infrastructure DNS drift and repair cleanup --- config/packages/README.md | 4 +- config/packages/docker_infrastructure.yaml | 38 ++++++- config/packages/infrastructure.yaml | 122 +++++++++++++++++++++ config/script/README.md | 1 + 4 files changed, 159 insertions(+), 6 deletions(-) diff --git a/config/packages/README.md b/config/packages/README.md index c60fd19e..a286db98 100755 --- a/config/packages/README.md +++ b/config/packages/README.md @@ -47,10 +47,10 @@ Live collection of plug-and-play Home Assistant packages. Each YAML file in this | [logbook_activity_feed.yaml](logbook_activity_feed.yaml) | Dummy `sensor.activity_feed` + helper to write clean Activity entries (Issue #1550). | `sensor.activity_feed`, `script.send_to_logbook` | | [mariadb_monitoring.yaml](mariadb_monitoring.yaml) | MariaDB health sensors and Lovelace dashboard snippet for recorder stats. | `sensor.mariadb_status`, `sensor.database_size` | | [llmvision.yaml](llmvision.yaml) | Vision-backed garage-can and front-door package checks with rate-limited, downscaled OpenAI calls for package detection. | `input_button.llmvision_*`, `binary_sensor.front_door_packages_present`, `llmvision.stream_analyzer` | -| [docker_infrastructure.yaml](docker_infrastructure.yaml) | Docker host patching telemetry, container/stack Repairs automation, 20-minute Joanna escalation for persistent container outages using stable configured monitor membership, and weekly scheduled prune actions across docker_10/14/17/69; the dedicated codex_appliance VM is monitored through BearClaw status telemetry. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `script.joanna_dispatch` | +| [docker_infrastructure.yaml](docker_infrastructure.yaml) | Docker host patching telemetry, container/stack Repairs automation, retired Portainer repair cleanup, 20-minute Joanna escalation for persistent container outages using stable configured monitor membership, and weekly scheduled prune actions across docker_10/14/17/69; the dedicated codex_appliance VM is monitored through BearClaw status telemetry. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `repairs.remove`, `script.joanna_dispatch` | | [proxmox.yaml](proxmox.yaml) | Proxmox runtime and disk pressure monitoring with Repairs + Joanna dispatch for sustained node degradations, plus nightly Frigate reboot. | `binary_sensor.proxmox*_runtime_healthy`, `sensor.proxmox*_disk_used_percentage`, `repairs.create`, `script.joanna_dispatch`, `button.qemu_docker2_101_reboot` | | [synology_dsm.yaml](synology_dsm.yaml) | Synology DSM integration health normalization for Carlo-NAS01 and Carlo-NVR, with outage-aware Joanna-first handling for lone post-outage volume warnings and Repairs escalation for persistent or non-outage problems. | `binary_sensor.carlo_*_synology_problem`, `sensor.carlo_*_synology_problem_summary`, `binary_sensor.powerwall_grid_status`, `repairs.create`, `script.joanna_dispatch` | -| [infrastructure.yaml](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health, Nebula Sync primary/backup Pi-hole consistency monitoring with Joanna dispatch, Glances-backed Docker host disk pressure with Joanna-only warning cleanup and critical Repairs, and website uptime/latency SLO signals for Infrastructure dashboards, plus nightly backup verification and monthly Joanna HA log hygiene review with GitHub issue follow-up. | `sensor.infra_nebula_sync_dns_consistency`, `binary_sensor.infra_nebula_sync_degraded`, `sensor.docker_*_disk_used_percentage`, `automation.infra_nebula_sync_health_dispatch`, `automation.docker_host_disk_pressure_monitor`, `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `script.joanna_dispatch` | +| [infrastructure.yaml](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health, Nebula Sync and promoted IoT primary/backup Pi-hole consistency monitoring with Joanna dispatch, Glances-backed Docker host disk pressure with Joanna-only warning cleanup and critical Repairs, and website uptime/latency SLO signals for Infrastructure dashboards, plus nightly backup verification and monthly Joanna HA log hygiene review with GitHub issue follow-up. | `sensor.infra_nebula_sync_dns_consistency`, `sensor.infra_pihole_iot_dns_consistency`, `binary_sensor.infra_nebula_sync_degraded`, `binary_sensor.infra_pihole_iot_dns_degraded`, `sensor.docker_*_disk_used_percentage`, `automation.infra_nebula_sync_health_dispatch`, `automation.infra_pihole_iot_dns_drift_dispatch`, `automation.docker_host_disk_pressure_monitor`, `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `script.joanna_dispatch` | | [onenote_indexer.yaml](onenote_indexer.yaml) | Dedicated-appliance OneNote indexer health/status monitoring for Joanna, explicit index-health confirmation, failure-repair automation, and a daily duplicate-delete maintenance request. | `sensor.onenote_indexer_last_job_status`, `binary_sensor.onenote_indexer_last_job_successful`, `binary_sensor.onenote_indexer_index_healthy` | | [mqtt_status.yaml](mqtt_status.yaml) | Command-line MQTT broker reachability probe with Spook Repairs escalation and Joanna troubleshooting dispatch on outage. | `binary_sensor.mqtt_status_raw`, `binary_sensor.mqtt_broker_problem`, `repairs.create`, `rest_command.bearclaw_command` | | [mariadb.yaml](mariadb.yaml) | MariaDB recorder health and capacity snapshots with hourly live metrics, weekly admin/recorder polling, and stats-ready numeric sensors. | `sensor.mariadb_status`, `sensor.database_size` | diff --git a/config/packages/docker_infrastructure.yaml b/config/packages/docker_infrastructure.yaml index 6f90576f..e2753bd8 100644 --- a/config/packages/docker_infrastructure.yaml +++ b/config/packages/docker_infrastructure.yaml @@ -20,6 +20,7 @@ # Notes: Treat telemetry reconnects from unavailable/unknown to a concrete stopped state as actionable outages. # Notes: Infra Info was removed; BearClaw Admin is the planning snapshot surface. # Notes: codex_appliance moved to a dedicated VM; monitor it through BearClaw status telemetry instead of old docker_17 container switches. +# Notes: Retired repair cleanup clears old codex_appliance and hashed dozzle Portainer repair IDs. ###################################################################### input_datetime: @@ -661,7 +662,13 @@ script: {% endif %} {{ resolver.state }} telemetry_degraded: "{{ is_state('binary_sensor.docker_container_telemetry_degraded', 'on') }}" - container_name: "{{ state_attr(effective_entity, 'friendly_name') | default(container_key, true) }}" + container_name: >- + {% set switch_name = state_attr(switch_entity, 'friendly_name') | default('', true) %} + {% set switch_alt_name = state_attr(switch_entity_alt, 'friendly_name') | default('', true) %} + {% set effective_name = state_attr(effective_entity, 'friendly_name') | default('', true) %} + {% set raw_name = switch_name if switch_name | trim != '' else (switch_alt_name if switch_alt_name | trim != '' else effective_name) %} + {% set normalized = raw_name | regex_replace('(?i)\\s+container$', '') | trim %} + {{ container_key if normalized | lower in ['', 'status', 'state'] else normalized }} - condition: template value_template: >- {{ effective_state in down_states and @@ -709,7 +716,13 @@ script: {% endfor %} {% endif %} {{ resolver.state }} - container_name: "{{ state_attr(effective_entity, 'friendly_name') | default(container_key, true) }}" + container_name: >- + {% set switch_name = state_attr(switch_entity, 'friendly_name') | default('', true) %} + {% set switch_alt_name = state_attr(switch_entity_alt, 'friendly_name') | default('', true) %} + {% set effective_name = state_attr(effective_entity, 'friendly_name') | default('', true) %} + {% set raw_name = switch_name if switch_name | trim != '' else (switch_alt_name if switch_alt_name | trim != '' else effective_name) %} + {% set normalized = raw_name | regex_replace('(?i)\\s+container$', '') | trim %} + {{ container_key if normalized | lower in ['', 'status', 'state'] else normalized }} - condition: template value_template: >- {{ persistent_effective_state in down_states and @@ -769,7 +782,13 @@ script: {% endfor %} {% endif %} {{ resolver.state }} - container_name: "{{ state_attr(effective_entity, 'friendly_name') | default(container_key, true) }}" + container_name: >- + {% set switch_name = state_attr(switch_entity, 'friendly_name') | default('', true) %} + {% set switch_alt_name = state_attr(switch_entity_alt, 'friendly_name') | default('', true) %} + {% set effective_name = state_attr(effective_entity, 'friendly_name') | default('', true) %} + {% set raw_name = switch_name if switch_name | trim != '' else (switch_alt_name if switch_alt_name | trim != '' else effective_name) %} + {% set normalized = raw_name | regex_replace('(?i)\\s+container$', '') | trim %} + {{ container_key if normalized | lower in ['', 'status', 'state'] else normalized }} - condition: template value_template: "{{ effective_state not in down_states }}" - service: repairs.remove @@ -1063,7 +1082,7 @@ automation: - alias: "Docker Repairs Reconcile" id: docker_repairs_reconcile - description: "Reconcile stale container and stack Repairs issues on startup and every 55 minutes." + description: "Reconcile stale container, stack, and retired Portainer Repairs issues on startup and every 55 minutes." mode: queued trigger: - platform: homeassistant @@ -1101,6 +1120,17 @@ automation: entity_id: "{{ repeat.item }}" operation: "clear" log_result: false + - repeat: + for_each: + - docker_container_codex_appliance_offline + - user_docker_container_codex_appliance_offline + - docker_container_39394ec38d4c_dozzle_offline + - user_docker_container_39394ec38d4c_dozzle_offline + sequence: + - service: repairs.remove + continue_on_error: true + data: + issue_id: "{{ repeat.item }}" - alias: "Docker Containers Maintenance Prompt" id: docker_containers_maintenance_prompt diff --git a/config/packages/infrastructure.yaml b/config/packages/infrastructure.yaml index b1074585..c6be9e52 100644 --- a/config/packages/infrastructure.yaml +++ b/config/packages/infrastructure.yaml @@ -18,6 +18,7 @@ # Notes: Disk-pressure dispatch allows bounded safe cleanup of disposable caches and old generated backup artifacts, but not live data or restarts. # Notes: Warning-level Docker host disk pressure is Joanna-only; Repairs are reserved for critical pressure. # Notes: Nebula Sync DNS consistency compares primary/backup Pi-hole answers and dispatches Joanna on sustained drift or container loss. +# Notes: Promoted IoT DNS consistency compares primary/backup Pi-hole answers for reserved IoT host records. ###################################################################### input_text: @@ -33,6 +34,9 @@ input_text: infra_nebula_sync_health_band: name: "Nebula Sync health band" max: 20 + infra_pihole_iot_dns_health_band: + name: "Pi-hole IoT DNS health band" + max: 20 input_boolean: infra_duplicati_backup_repair_active: @@ -86,6 +90,20 @@ command_line: - primary_reverse - secondary_reverse + - sensor: + name: Infra Pihole IoT DNS Consistency + unique_id: infra_pihole_iot_dns_consistency + command: >- + /bin/bash -c 'primary=192.168.10.10; secondary=192.168.10.14; records="rachio.fordst.com=192.168.10.73 econet.fordst.com=192.168.10.92 dreame-vacuum.fordst.com=192.168.10.93 carlo-bed.fordst.com=192.168.10.95 lg-smart-fridge.fordst.com=192.168.10.96 tesla-blackbox-gw.fordst.com=192.168.10.97 bgw210.fordst.com=192.168.10.98"; q(){ dig +time=2 +tries=1 +short @"$1" "$2" A 2>/dev/null | tr -d "\r" | sort | tr "\n" "," | sed "s/,$//"; }; status=ok; checked=0; mismatch_count=0; mismatches=""; for record in $records; do host=${record%%=*}; ip=${record#*=}; p=$(q "$primary" "$host"); s=$(q "$secondary" "$host"); checked=$((checked+1)); if [ "$p" != "$ip" ] || [ "$s" != "$ip" ]; then status=mismatch; mismatch_count=$((mismatch_count+1)); mismatches="${mismatches}${host}:expected=${ip},primary=${p:-none},secondary=${s:-none};"; fi; done; if [ -z "$mismatches" ]; then mismatches=none; fi; printf "{\"status\":\"%s\",\"checked_records\":%s,\"mismatch_count\":%s,\"mismatches\":\"%s\",\"primary_dns\":\"%s\",\"backup_dns\":\"%s\"}\n" "$status" "$checked" "$mismatch_count" "$mismatches" "$primary" "$secondary"' + scan_interval: 300 + value_template: "{{ value_json.status | default('unknown') }}" + json_attributes: + - checked_records + - mismatch_count + - mismatches + - primary_dns + - backup_dns + template: - sensor: - name: "Infra External IP" @@ -270,6 +288,19 @@ template: pihole_secondary_status: "{{ states('binary_sensor.pihole_secondary_status') }}" pihole_secondary_status_2: "{{ states('binary_sensor.pihole_secondary_status_2') }}" + - name: "Infra Pihole IoT DNS Degraded" + unique_id: infra_pihole_iot_dns_degraded + device_class: problem + state: >- + {{ states('sensor.infra_pihole_iot_dns_consistency') | lower != 'ok' }} + attributes: + dns_consistency: "{{ states('sensor.infra_pihole_iot_dns_consistency') }}" + checked_records: "{{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'checked_records') }}" + mismatch_count: "{{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'mismatch_count') }}" + mismatches: "{{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'mismatches') }}" + primary_dns: "{{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'primary_dns') }}" + backup_dns: "{{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'backup_dns') }}" + - name: "Infra UPS On Battery" unique_id: infra_ups_on_battery device_class: problem @@ -553,6 +584,97 @@ automation: data: value: normal + - alias: "Infrastructure - Pi-hole IoT DNS Drift Dispatch" + id: infra_pihole_iot_dns_drift_dispatch + description: "Dispatch Joanna when promoted IoT Pi-hole DNS records drift across primary and backup resolvers." + mode: queued + trigger: + - platform: state + entity_id: binary_sensor.infra_pihole_iot_dns_degraded + to: "on" + for: "00:10:00" + id: degraded + - platform: state + entity_id: binary_sensor.infra_pihole_iot_dns_degraded + to: "off" + for: "00:02:00" + id: recovered + - platform: event + event_type: homeassistant_started + id: reconcile + - platform: time_pattern + minutes: "7" + id: reconcile + variables: + issue_id: infra_pihole_iot_dns_degraded + dns_state: "{{ states('sensor.infra_pihole_iot_dns_consistency') }}" + previous_band: "{{ states('input_text.infra_pihole_iot_dns_health_band') | lower }}" + degraded: "{{ is_state('binary_sensor.infra_pihole_iot_dns_degraded', 'on') }}" + action: + - choose: + - conditions: "{{ degraded and previous_band != 'warning' }}" + sequence: + - service: repairs.remove + continue_on_error: true + data: + issue_id: "{{ issue_id }}" + - service: script.joanna_dispatch + data: + trigger_context: "HA automation infra_pihole_iot_dns_drift_dispatch (Infrastructure - Pi-hole IoT DNS Drift Dispatch)" + source: "home_assistant_automation.infra_pihole_iot_dns_drift_dispatch.warning" + summary: "Promoted IoT Pi-hole DNS records drifted across primary and backup resolvers" + entity_ids: + - sensor.infra_pihole_iot_dns_consistency + - binary_sensor.infra_pihole_iot_dns_degraded + diagnostics: >- + issue_id={{ issue_id }}, + dns_consistency={{ dns_state }}, + checked_records={{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'checked_records') }}, + mismatch_count={{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'mismatch_count') }}, + mismatches={{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'mismatches') }}, + primary_dns={{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'primary_dns') }}, + backup_dns={{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'backup_dns') }} + request: >- + Investigate primary/backup Pi-hole DNS drift for promoted IoT reservations. + Verify both Pi-holes answer rachio, econet, dreame-vacuum, carlo-bed, lg-smart-fridge, tesla-blackbox-gw, and bgw210 FQDNs with the expected reserved IPs. + Check primary and backup pihole.toml local DNS host records, Nebula Sync behavior, and generated custom.list files. + Do not change DHCP/custom DNS records unless diagnostics prove drift and the action is safe. + Reply with resolved=true/false, root_cause, action_taken, verification, and next_action_required=true/false. + domain_hint: ops + lane_hint: joanna.ops + - service: script.send_to_logbook + data: + topic: "DNS" + message: >- + Promoted IoT Pi-hole DNS consistency is degraded ({{ dns_state }}); Joanna investigation requested without opening a Repair. + - service: input_text.set_value + target: + entity_id: input_text.infra_pihole_iot_dns_health_band + data: + value: warning + - conditions: "{{ not degraded and previous_band in ['warning', 'unavailable'] }}" + sequence: + - service: repairs.remove + continue_on_error: true + data: + issue_id: "{{ issue_id }}" + - service: script.send_to_logbook + data: + topic: "DNS" + message: "Promoted IoT Pi-hole DNS consistency recovered; Joanna-only warning state cleared." + - service: input_text.set_value + target: + entity_id: input_text.infra_pihole_iot_dns_health_band + data: + value: normal + - conditions: "{{ not degraded and previous_band not in ['normal', 'warning', 'unavailable'] }}" + sequence: + - service: input_text.set_value + target: + entity_id: input_text.infra_pihole_iot_dns_health_band + data: + value: normal + - alias: "Docker Host Disk Pressure Monitor" id: docker_host_disk_pressure_monitor description: "Track Docker host root disk pressure from normalized Glances sensors and dispatch Joanna on band changes." diff --git a/config/script/README.md b/config/script/README.md index 361bc7d4..6a8c849e 100755 --- a/config/script/README.md +++ b/config/script/README.md @@ -60,6 +60,7 @@ Current automations that kick off automated resolutions (via `script.joanna_disp | `infra_backup_nightly_verification` | Infrastructure - Backup Nightly Verification | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) | | `infra_monthly_log_hygiene_review` | Infrastructure - Monthly HA Log Hygiene Review | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) | | `infra_nebula_sync_health_dispatch` | Infrastructure - Nebula Sync Health Dispatch | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) | +| `infra_pihole_iot_dns_drift_dispatch` | Infrastructure - Pi-hole IoT DNS Drift Dispatch | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) | | `docker_state_sync_repairs_dynamic` | Docker State Sync - Repairs (Dynamic) | [../packages/docker_infrastructure.yaml](../packages/docker_infrastructure.yaml) | | `docker_group_reconcile_weekly_joanna_review` | Docker Group Reconcile - Weekly Joanna Review | [../packages/docker_infrastructure.yaml](../packages/docker_infrastructure.yaml) | | `docker_host_disk_pressure_monitor` | Docker Host Disk Pressure Monitor | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) |