Add infrastructure DNS drift and repair cleanup

master
Carlo Costanzo 7 days ago
parent 68e14fc1b5
commit 9ae1ee0968

@ -47,10 +47,10 @@ Live collection of plug-and-play Home Assistant packages. Each YAML file in this
| [logbook_activity_feed.yaml](logbook_activity_feed.yaml) | Dummy `sensor.activity_feed` + helper to write clean Activity entries (Issue #1550). | `sensor.activity_feed`, `script.send_to_logbook` |
| [mariadb_monitoring.yaml](mariadb_monitoring.yaml) | MariaDB health sensors and Lovelace dashboard snippet for recorder stats. | `sensor.mariadb_status`, `sensor.database_size` |
| [llmvision.yaml](llmvision.yaml) | Vision-backed garage-can and front-door package checks with rate-limited, downscaled OpenAI calls for package detection. | `input_button.llmvision_*`, `binary_sensor.front_door_packages_present`, `llmvision.stream_analyzer` |
| [docker_infrastructure.yaml](docker_infrastructure.yaml) | Docker host patching telemetry, container/stack Repairs automation, 20-minute Joanna escalation for persistent container outages using stable configured monitor membership, and weekly scheduled prune actions across docker_10/14/17/69; the dedicated codex_appliance VM is monitored through BearClaw status telemetry. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `script.joanna_dispatch` |
| [docker_infrastructure.yaml](docker_infrastructure.yaml) | Docker host patching telemetry, container/stack Repairs automation, retired Portainer repair cleanup, 20-minute Joanna escalation for persistent container outages using stable configured monitor membership, and weekly scheduled prune actions across docker_10/14/17/69; the dedicated codex_appliance VM is monitored through BearClaw status telemetry. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `repairs.remove`, `script.joanna_dispatch` |
| [proxmox.yaml](proxmox.yaml) | Proxmox runtime and disk pressure monitoring with Repairs + Joanna dispatch for sustained node degradations, plus nightly Frigate reboot. | `binary_sensor.proxmox*_runtime_healthy`, `sensor.proxmox*_disk_used_percentage`, `repairs.create`, `script.joanna_dispatch`, `button.qemu_docker2_101_reboot` |
| [synology_dsm.yaml](synology_dsm.yaml) | Synology DSM integration health normalization for Carlo-NAS01 and Carlo-NVR, with outage-aware Joanna-first handling for lone post-outage volume warnings and Repairs escalation for persistent or non-outage problems. | `binary_sensor.carlo_*_synology_problem`, `sensor.carlo_*_synology_problem_summary`, `binary_sensor.powerwall_grid_status`, `repairs.create`, `script.joanna_dispatch` |
| [infrastructure.yaml](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health, Nebula Sync primary/backup Pi-hole consistency monitoring with Joanna dispatch, Glances-backed Docker host disk pressure with Joanna-only warning cleanup and critical Repairs, and website uptime/latency SLO signals for Infrastructure dashboards, plus nightly backup verification and monthly Joanna HA log hygiene review with GitHub issue follow-up. | `sensor.infra_nebula_sync_dns_consistency`, `binary_sensor.infra_nebula_sync_degraded`, `sensor.docker_*_disk_used_percentage`, `automation.infra_nebula_sync_health_dispatch`, `automation.docker_host_disk_pressure_monitor`, `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `script.joanna_dispatch` |
| [infrastructure.yaml](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health, Nebula Sync and promoted IoT primary/backup Pi-hole consistency monitoring with Joanna dispatch, Glances-backed Docker host disk pressure with Joanna-only warning cleanup and critical Repairs, and website uptime/latency SLO signals for Infrastructure dashboards, plus nightly backup verification and monthly Joanna HA log hygiene review with GitHub issue follow-up. | `sensor.infra_nebula_sync_dns_consistency`, `sensor.infra_pihole_iot_dns_consistency`, `binary_sensor.infra_nebula_sync_degraded`, `binary_sensor.infra_pihole_iot_dns_degraded`, `sensor.docker_*_disk_used_percentage`, `automation.infra_nebula_sync_health_dispatch`, `automation.infra_pihole_iot_dns_drift_dispatch`, `automation.docker_host_disk_pressure_monitor`, `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `script.joanna_dispatch` |
| [onenote_indexer.yaml](onenote_indexer.yaml) | Dedicated-appliance OneNote indexer health/status monitoring for Joanna, explicit index-health confirmation, failure-repair automation, and a daily duplicate-delete maintenance request. | `sensor.onenote_indexer_last_job_status`, `binary_sensor.onenote_indexer_last_job_successful`, `binary_sensor.onenote_indexer_index_healthy` |
| [mqtt_status.yaml](mqtt_status.yaml) | Command-line MQTT broker reachability probe with Spook Repairs escalation and Joanna troubleshooting dispatch on outage. | `binary_sensor.mqtt_status_raw`, `binary_sensor.mqtt_broker_problem`, `repairs.create`, `rest_command.bearclaw_command` |
| [mariadb.yaml](mariadb.yaml) | MariaDB recorder health and capacity snapshots with hourly live metrics, weekly admin/recorder polling, and stats-ready numeric sensors. | `sensor.mariadb_status`, `sensor.database_size` |

@ -20,6 +20,7 @@
# Notes: Treat telemetry reconnects from unavailable/unknown to a concrete stopped state as actionable outages.
# Notes: Infra Info was removed; BearClaw Admin is the planning snapshot surface.
# Notes: codex_appliance moved to a dedicated VM; monitor it through BearClaw status telemetry instead of old docker_17 container switches.
# Notes: Retired repair cleanup clears old codex_appliance and hashed dozzle Portainer repair IDs.
######################################################################
input_datetime:
@ -661,7 +662,13 @@ script:
{% endif %}
{{ resolver.state }}
telemetry_degraded: "{{ is_state('binary_sensor.docker_container_telemetry_degraded', 'on') }}"
container_name: "{{ state_attr(effective_entity, 'friendly_name') | default(container_key, true) }}"
container_name: >-
{% set switch_name = state_attr(switch_entity, 'friendly_name') | default('', true) %}
{% set switch_alt_name = state_attr(switch_entity_alt, 'friendly_name') | default('', true) %}
{% set effective_name = state_attr(effective_entity, 'friendly_name') | default('', true) %}
{% set raw_name = switch_name if switch_name | trim != '' else (switch_alt_name if switch_alt_name | trim != '' else effective_name) %}
{% set normalized = raw_name | regex_replace('(?i)\\s+container$', '') | trim %}
{{ container_key if normalized | lower in ['', 'status', 'state'] else normalized }}
- condition: template
value_template: >-
{{ effective_state in down_states and
@ -709,7 +716,13 @@ script:
{% endfor %}
{% endif %}
{{ resolver.state }}
container_name: "{{ state_attr(effective_entity, 'friendly_name') | default(container_key, true) }}"
container_name: >-
{% set switch_name = state_attr(switch_entity, 'friendly_name') | default('', true) %}
{% set switch_alt_name = state_attr(switch_entity_alt, 'friendly_name') | default('', true) %}
{% set effective_name = state_attr(effective_entity, 'friendly_name') | default('', true) %}
{% set raw_name = switch_name if switch_name | trim != '' else (switch_alt_name if switch_alt_name | trim != '' else effective_name) %}
{% set normalized = raw_name | regex_replace('(?i)\\s+container$', '') | trim %}
{{ container_key if normalized | lower in ['', 'status', 'state'] else normalized }}
- condition: template
value_template: >-
{{ persistent_effective_state in down_states and
@ -769,7 +782,13 @@ script:
{% endfor %}
{% endif %}
{{ resolver.state }}
container_name: "{{ state_attr(effective_entity, 'friendly_name') | default(container_key, true) }}"
container_name: >-
{% set switch_name = state_attr(switch_entity, 'friendly_name') | default('', true) %}
{% set switch_alt_name = state_attr(switch_entity_alt, 'friendly_name') | default('', true) %}
{% set effective_name = state_attr(effective_entity, 'friendly_name') | default('', true) %}
{% set raw_name = switch_name if switch_name | trim != '' else (switch_alt_name if switch_alt_name | trim != '' else effective_name) %}
{% set normalized = raw_name | regex_replace('(?i)\\s+container$', '') | trim %}
{{ container_key if normalized | lower in ['', 'status', 'state'] else normalized }}
- condition: template
value_template: "{{ effective_state not in down_states }}"
- service: repairs.remove
@ -1063,7 +1082,7 @@ automation:
- alias: "Docker Repairs Reconcile"
id: docker_repairs_reconcile
description: "Reconcile stale container and stack Repairs issues on startup and every 55 minutes."
description: "Reconcile stale container, stack, and retired Portainer Repairs issues on startup and every 55 minutes."
mode: queued
trigger:
- platform: homeassistant
@ -1101,6 +1120,17 @@ automation:
entity_id: "{{ repeat.item }}"
operation: "clear"
log_result: false
- repeat:
for_each:
- docker_container_codex_appliance_offline
- user_docker_container_codex_appliance_offline
- docker_container_39394ec38d4c_dozzle_offline
- user_docker_container_39394ec38d4c_dozzle_offline
sequence:
- service: repairs.remove
continue_on_error: true
data:
issue_id: "{{ repeat.item }}"
- alias: "Docker Containers Maintenance Prompt"
id: docker_containers_maintenance_prompt

@ -18,6 +18,7 @@
# Notes: Disk-pressure dispatch allows bounded safe cleanup of disposable caches and old generated backup artifacts, but not live data or restarts.
# Notes: Warning-level Docker host disk pressure is Joanna-only; Repairs are reserved for critical pressure.
# Notes: Nebula Sync DNS consistency compares primary/backup Pi-hole answers and dispatches Joanna on sustained drift or container loss.
# Notes: Promoted IoT DNS consistency compares primary/backup Pi-hole answers for reserved IoT host records.
######################################################################
input_text:
@ -33,6 +34,9 @@ input_text:
infra_nebula_sync_health_band:
name: "Nebula Sync health band"
max: 20
infra_pihole_iot_dns_health_band:
name: "Pi-hole IoT DNS health band"
max: 20
input_boolean:
infra_duplicati_backup_repair_active:
@ -86,6 +90,20 @@ command_line:
- primary_reverse
- secondary_reverse
- sensor:
name: Infra Pihole IoT DNS Consistency
unique_id: infra_pihole_iot_dns_consistency
command: >-
/bin/bash -c 'primary=192.168.10.10; secondary=192.168.10.14; records="rachio.fordst.com=192.168.10.73 econet.fordst.com=192.168.10.92 dreame-vacuum.fordst.com=192.168.10.93 carlo-bed.fordst.com=192.168.10.95 lg-smart-fridge.fordst.com=192.168.10.96 tesla-blackbox-gw.fordst.com=192.168.10.97 bgw210.fordst.com=192.168.10.98"; q(){ dig +time=2 +tries=1 +short @"$1" "$2" A 2>/dev/null | tr -d "\r" | sort | tr "\n" "," | sed "s/,$//"; }; status=ok; checked=0; mismatch_count=0; mismatches=""; for record in $records; do host=${record%%=*}; ip=${record#*=}; p=$(q "$primary" "$host"); s=$(q "$secondary" "$host"); checked=$((checked+1)); if [ "$p" != "$ip" ] || [ "$s" != "$ip" ]; then status=mismatch; mismatch_count=$((mismatch_count+1)); mismatches="${mismatches}${host}:expected=${ip},primary=${p:-none},secondary=${s:-none};"; fi; done; if [ -z "$mismatches" ]; then mismatches=none; fi; printf "{\"status\":\"%s\",\"checked_records\":%s,\"mismatch_count\":%s,\"mismatches\":\"%s\",\"primary_dns\":\"%s\",\"backup_dns\":\"%s\"}\n" "$status" "$checked" "$mismatch_count" "$mismatches" "$primary" "$secondary"'
scan_interval: 300
value_template: "{{ value_json.status | default('unknown') }}"
json_attributes:
- checked_records
- mismatch_count
- mismatches
- primary_dns
- backup_dns
template:
- sensor:
- name: "Infra External IP"
@ -270,6 +288,19 @@ template:
pihole_secondary_status: "{{ states('binary_sensor.pihole_secondary_status') }}"
pihole_secondary_status_2: "{{ states('binary_sensor.pihole_secondary_status_2') }}"
- name: "Infra Pihole IoT DNS Degraded"
unique_id: infra_pihole_iot_dns_degraded
device_class: problem
state: >-
{{ states('sensor.infra_pihole_iot_dns_consistency') | lower != 'ok' }}
attributes:
dns_consistency: "{{ states('sensor.infra_pihole_iot_dns_consistency') }}"
checked_records: "{{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'checked_records') }}"
mismatch_count: "{{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'mismatch_count') }}"
mismatches: "{{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'mismatches') }}"
primary_dns: "{{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'primary_dns') }}"
backup_dns: "{{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'backup_dns') }}"
- name: "Infra UPS On Battery"
unique_id: infra_ups_on_battery
device_class: problem
@ -553,6 +584,97 @@ automation:
data:
value: normal
- alias: "Infrastructure - Pi-hole IoT DNS Drift Dispatch"
id: infra_pihole_iot_dns_drift_dispatch
description: "Dispatch Joanna when promoted IoT Pi-hole DNS records drift across primary and backup resolvers."
mode: queued
trigger:
- platform: state
entity_id: binary_sensor.infra_pihole_iot_dns_degraded
to: "on"
for: "00:10:00"
id: degraded
- platform: state
entity_id: binary_sensor.infra_pihole_iot_dns_degraded
to: "off"
for: "00:02:00"
id: recovered
- platform: event
event_type: homeassistant_started
id: reconcile
- platform: time_pattern
minutes: "7"
id: reconcile
variables:
issue_id: infra_pihole_iot_dns_degraded
dns_state: "{{ states('sensor.infra_pihole_iot_dns_consistency') }}"
previous_band: "{{ states('input_text.infra_pihole_iot_dns_health_band') | lower }}"
degraded: "{{ is_state('binary_sensor.infra_pihole_iot_dns_degraded', 'on') }}"
action:
- choose:
- conditions: "{{ degraded and previous_band != 'warning' }}"
sequence:
- service: repairs.remove
continue_on_error: true
data:
issue_id: "{{ issue_id }}"
- service: script.joanna_dispatch
data:
trigger_context: "HA automation infra_pihole_iot_dns_drift_dispatch (Infrastructure - Pi-hole IoT DNS Drift Dispatch)"
source: "home_assistant_automation.infra_pihole_iot_dns_drift_dispatch.warning"
summary: "Promoted IoT Pi-hole DNS records drifted across primary and backup resolvers"
entity_ids:
- sensor.infra_pihole_iot_dns_consistency
- binary_sensor.infra_pihole_iot_dns_degraded
diagnostics: >-
issue_id={{ issue_id }},
dns_consistency={{ dns_state }},
checked_records={{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'checked_records') }},
mismatch_count={{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'mismatch_count') }},
mismatches={{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'mismatches') }},
primary_dns={{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'primary_dns') }},
backup_dns={{ state_attr('sensor.infra_pihole_iot_dns_consistency', 'backup_dns') }}
request: >-
Investigate primary/backup Pi-hole DNS drift for promoted IoT reservations.
Verify both Pi-holes answer rachio, econet, dreame-vacuum, carlo-bed, lg-smart-fridge, tesla-blackbox-gw, and bgw210 FQDNs with the expected reserved IPs.
Check primary and backup pihole.toml local DNS host records, Nebula Sync behavior, and generated custom.list files.
Do not change DHCP/custom DNS records unless diagnostics prove drift and the action is safe.
Reply with resolved=true/false, root_cause, action_taken, verification, and next_action_required=true/false.
domain_hint: ops
lane_hint: joanna.ops
- service: script.send_to_logbook
data:
topic: "DNS"
message: >-
Promoted IoT Pi-hole DNS consistency is degraded ({{ dns_state }}); Joanna investigation requested without opening a Repair.
- service: input_text.set_value
target:
entity_id: input_text.infra_pihole_iot_dns_health_band
data:
value: warning
- conditions: "{{ not degraded and previous_band in ['warning', 'unavailable'] }}"
sequence:
- service: repairs.remove
continue_on_error: true
data:
issue_id: "{{ issue_id }}"
- service: script.send_to_logbook
data:
topic: "DNS"
message: "Promoted IoT Pi-hole DNS consistency recovered; Joanna-only warning state cleared."
- service: input_text.set_value
target:
entity_id: input_text.infra_pihole_iot_dns_health_band
data:
value: normal
- conditions: "{{ not degraded and previous_band not in ['normal', 'warning', 'unavailable'] }}"
sequence:
- service: input_text.set_value
target:
entity_id: input_text.infra_pihole_iot_dns_health_band
data:
value: normal
- alias: "Docker Host Disk Pressure Monitor"
id: docker_host_disk_pressure_monitor
description: "Track Docker host root disk pressure from normalized Glances sensors and dispatch Joanna on band changes."

@ -60,6 +60,7 @@ Current automations that kick off automated resolutions (via `script.joanna_disp
| `infra_backup_nightly_verification` | Infrastructure - Backup Nightly Verification | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) |
| `infra_monthly_log_hygiene_review` | Infrastructure - Monthly HA Log Hygiene Review | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) |
| `infra_nebula_sync_health_dispatch` | Infrastructure - Nebula Sync Health Dispatch | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) |
| `infra_pihole_iot_dns_drift_dispatch` | Infrastructure - Pi-hole IoT DNS Drift Dispatch | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) |
| `docker_state_sync_repairs_dynamic` | Docker State Sync - Repairs (Dynamic) | [../packages/docker_infrastructure.yaml](../packages/docker_infrastructure.yaml) |
| `docker_group_reconcile_weekly_joanna_review` | Docker Group Reconcile - Weekly Joanna Review | [../packages/docker_infrastructure.yaml](../packages/docker_infrastructure.yaml) |
| `docker_host_disk_pressure_monitor` | Docker Host Disk Pressure Monitor | [../packages/infrastructure.yaml](../packages/infrastructure.yaml) |

Loading…
Cancel
Save

Powered by TurnKey Linux.