FreeDMR/docs/freedmr-2-architecture-deci...

# FreeDMR 2.0 Architecture Decisions

This file records architectural decisions, requirements, assumptions and open
questions driven out during design discussion. It is intended as source material
for a later formal FreeDMR 2.0 design document.

## Project Philosophy

FreeDMR is open-source, open, intentionally understandable and intentionally
simple enough to encourage community implementation, experimentation and
operation by radio amateurs.

HBLink proved that a DMR server could be written in an open, readable way
without DMR being gatekept by commercial vendors. FreeDMR takes the next step:
it proves that a DMR network can be built this way without central control.
Before HBLink and FreeDMR, DMR server software and server-level network
membership were typically closed, gatekept or dependent on personal/team
approval. FreeDMR exists in part to lower that barrier and give radio amateurs
choice and freedom to experiment with global-scale ROIP networking.
FreeDMR does not need to gatekeep all private experimentation. The project
controls public listing: the process by which servers are shared with Pi-Star
and other HBP hotspots as legitimate public access servers. A sysop can run a
private server under their own DMR ID and arrange gatewaying with an existing
sysop, who effectively vouches for that traffic. Public listing has additional
requirements such as connectivity quality, sysop contactability and basic
operational expectations.

The FreeDMR mesh design is influenced by the late Bob Bruninga's APRS ideas,
Spanning Tree Protocol and related distributed-network approaches. The project
also has a social purpose: bringing together communities and people connected
to earlier amateur-radio networking work. FreeDMR is therefore both a technical
system and a diplomacy project; design choices must respect operational
autonomy, interoperability and trust between independent sysops.

FreeDMR is successful because it works in the amateur-radio sense: it is best
effort, experimental, approachable and deployable on ordinary low-cost systems
such as cheap VPS instances and Raspberry Pi-class hardware. It is not intended
to be a safety-assured commercial system. FreeDMR 2.0 should improve quality,
clarity and scalability without losing the ham-spirit/hacker-philosophy traits
that made the network useful and welcoming.

Design implications:

- Prefer clear, inspectable protocols over opaque mechanisms.
- Keep the implementation understandable by competent sysops and contributors.
- Keep the barrier to compatible implementations low where possible.
- Preserve low-cost deployment and modest hardware requirements.
- Avoid architectural choices that make FreeDMR dependent on heavyweight
  infrastructure for ordinary single-server operation.
- Treat reliability as best-effort resilience appropriate to amateur radio, not
  as commercial safety assurance.
- Preserve server autonomy and local policy.
- Avoid unnecessary central control.
- Distinguish private operation, vouched/gatewayed traffic and public listing.
- Security should protect authenticity and network integrity without hiding
  amateur-radio traffic.

## Protected Model

The protected asset is the FreeDMR operating model, not the old HBLink-derived
object structure.

Preserve:

- packet model and protocol behaviour
- dial-a-TG semantics
- TG/DMR-ID centric routing
- loop control
- source quench
- mesh behaviour
- practical RF/network tolerance learned from live servers and real RF links
- "everything everywhere" principle, subject to documented exceptions

Replace or redesign where useful:

- configured `MASTER` stanza as primary runtime identity
- proxy-mediated client fan-out
- global mutable `BRIDGES` structure as authoritative state
- custom dashboard/reporting socket protocol
- packet-path coupling to dashboard/API/report consumers

## Layer Model

FreeDMR 2.0 should be described as layered:

- **Access layer**: client/server access protocols such as HBP today and
  possible future non-trunk client protocols. Owns login/auth/options/keepalive,
  client sessions, slot state and RF-facing TG presentation.
- **Subscription layer**: talkgroup conference membership. Owns direct TG
  subscriptions, dial-a-TG subscriptions, static/default/user-activated
  subscriptions, expiry and RF-visible TG to conference TG mapping.
- **Mesh layer**: inter-server FBP/OBP/trunk-style behaviour. Owns loop control,
  source quench, hop/version handling and inter-server conference traffic.
- **Reporting layer**: local dashboard, API observers, logs, global lastheard
  export and state snapshots. Reporting is observational and must not steer
  packet handling.

## Reactor and Runtime Migration

Do not replace Twisted as part of the first FreeDMR 2.0 architecture work.

Decision:

- Keep Twisted's single-threaded reactor as a safety boundary initially.
- Extract and test the protocol/routing/subscription core behind deterministic
  interfaces.
- Introduce explicit process/message boundaries only after the state model is
  clear.
- Consider asyncio or another event loop only once Twisted has become a thin
  transport shell around tested core logic.

Rationale:

- The current packet behaviour is subtle and validated through real RF/network
  deployment.
- Replacing the event loop while also replacing the state model would mix too
  many sources of behavioural change.
- Twisted's single-threaded reactor helps preserve current ordering assumptions
  while bridge/subscription and reporting boundaries are made explicit.
- The first migration target is architectural clarity and scalability, not event
  loop novelty.

## Identity Model

The configured master/listener is not the client identity.

FreeDMR 2.0 should move toward:

- listener identity: UDP socket/service instance
- client identity: DMR peer/client ID
- subscription identity: client ID + slot + RF-visible TG + conference TG
- mesh identity: server/peer/network ID

Server identity hierarchy:

- FreeDMR server IDs are 4-digit DMR IDs.
- Server sub-IDs are 5-digit IDs derived from the server ID space.
- Each sysop/server identity may therefore cover up to 10 server sub-IDs for
  backend components, larger deployments, failover or fault-tolerant layouts.
- Identity verification should cover the base server ID and its authorized
  sub-IDs rather than requiring unrelated credentials for each sub-ID.

A single master/listener UDP port should serve an arbitrary number of clients
directly, replacing the proxy where possible.

## Talkgroup Subscription Model

Conceptually, each TG is a conference bridge. Clients subscribe to conference
TGs. FreeDMR does not primarily decide where to send user traffic; users choose
the traffic they want to hear by subscription.

Subscriptions can be:

- direct TG: RF-visible TG equals conference TG
- dial-a-TG: RF-visible TG is currently TG9, conference TG is the selected TG
- alias/rewrite: RF-visible TG may be any configured TG, conference TG is the
  FreeDMR network identity

Example:

```python
TalkgroupSubscription(
    client_id=2345001,
    slot=2,
    conference_tg=4400,
    rf_tg=9,
    mode="dial",
    active=True,
)
```

The invariant is:

```text
conference_tg = FreeDMR network/conference identity
rf_tg         = client-facing RF presentation identity
```

This makes arbitrary TG rewrites possible without making TG9 structurally
special.

## Bridge Table Replacement

The legacy `BRIDGES` dict should be replaced internally by subscription-oriented
state and indexes. The `"#"` reflector naming convention does not need to be
preserved internally; it can be a compatibility/export detail.

Recommended hot-path structures:

- `dict` / `set` for O(1)-style local lookups
- `typing.NamedTuple` keys for readable hash keys
- `dataclass(slots=True)` records for mutable subscription/session state
- `heapq` for expiry timers using lazy invalidation

Recommended indexes:

```python
subscriptions_by_conference_tg[conference_tg] -> set[SubscriptionKey]
subscription_by_rf[(client_id, slot, rf_tg)] -> SubscriptionKey
subscriptions_by_client_slot[(client_id, slot)] -> set[SubscriptionKey]
expiry_heap -> (expires_at, generation, SubscriptionKey)
```

Packet handlers should not scan all subscriptions/bridges to find routing
targets.

## Packet Plane vs Control Plane

The packet plane is delay-sensitive.

Packet-plane rules:

- local in-memory hot state only
- no external database round trips
- no blocking API/dashboard/report calls
- no cross-process lock waits
- no dependency on reporting consumers being connected

External stores may be used for:

- config distribution
- API/dashboard state
- control-plane coordination
- snapshots
- global lastheard export
- optional clustering/multi-process coordination

General performance principle:

- Expensive processing should be considered for offload to separate processes
  because CPython execution is constrained by the GIL for CPU-bound Python code.
- Offload is appropriate for reporting fanout, global export, dashboard
  aggregation, historical database writes, heavy analytics, expensive
  transcoding/codec experiments and non-critical maintenance jobs.
- Offload boundaries must be asynchronous from the packet path. If an offload
  worker is slow or unavailable, packet handling must continue with local state.
- Do not offload hot-path routing decisions if doing so would add inter-process,
  network or lock waits to every packet.

## DMR Data Packet Policy

FreeDMR must maintain DMR data packet forwarding support.

Decision:

- FreeDMR should forward supported DMR data packets according to the same
  conference/subscription and mesh principles as other traffic.
- There must be no regression in existing data packet forwarding support.
- FreeDMR core should not become an application-level DMR data processor.
- GPS, SMS and similar application processing should be implemented by systems
  connected via FBP or another mesh/access-adjacent interface.
- `DATA_GATEWAY` is understood as an earlier expression of this model: an FBP
  link that carries data-oriented traffic rather than ordinary voice traffic.
- Existing `SUB_MAP` behaviour is intentional: data addressed to a DMR ID can be
  routed toward the last known HBP/client location for that DMR ID.

Core FreeDMR may inspect/classify data packets only as needed for:

- packet admission and protocol validation
- routing/subscription decisions
- loop control and source quench
- reporting/logging
- preserving packet bytes and metadata across FBP/HBP boundaries
- maintaining the subscriber location map needed for data-client routing

Possible narrow exceptions:

- dial-a-TG control via DMR SMS
- DMR SMS alerts from a server to a sysop

Any such exceptions must be explicit control-plane features and must not turn
FreeDMR core into a general GPS/SMS application processor.

## Mesh Peer Authentication

FreeDMR should only accept mesh/FBP traffic from servers that can be validated
as legitimate members of the network.

Core principle:

- FreeDMR may sign/authenticate traffic and control messages, but should not
  encrypt amateur-radio traffic or mesh traffic by default.
- Amateur radio is public in most jurisdictions and encryption is often not
  permitted. FreeDMR users may also carry IP backhaul over amateur radio links.
- FreeDMR's security model is authenticity, integrity, membership validation and
  local policy enforcement, not secrecy.
- This follows the existing FreeDMR principle, agreed historically by project
  maintainers, that the network has nothing to hide and should remain cleartext.

Identity/listing distinction:

- Signed mesh identity should prove a server/sysop identity or a vouching
  relationship. It should not automatically imply public listing.
- Public listing is a directory/discovery decision for clients and HBP hotspots.
- A public access server may need stronger operational requirements than a
  private or gatewayed server.
- Local sysops may still choose whether to carry/vouch for traffic from private
  servers, even when those servers are not publicly listed.
- If an individual 7-digit DMR ID is used as a server identity, traffic may pass
  when a directly connected/listed sysop chooses to allow and gateway it.
- The vouching sysop is accountable to their peers for traffic they forward. If
  that traffic harms the network, peers may choose to stop peering with the
  vouching server. This preserves a self-policing social mechanism without
  requiring central control for all private experimentation.

Analogue network bridges:

- Analogue ROIP/network bridges commonly connect as if they are DMR clients via
  HBP.
- FreeDMR permits this and is generally more permissive than many other DMR
  networks.
- FreeDMR works with/supports the DVSwitch community on this. DVSwitch provides
  a common mechanism by which analogue networks can be bridged into DMR-style
  access.
- These bridges are operationally sensitive: technical limitations can make
  them effectively listen-only, consuming CPU and bandwidth while adding little
  value if they do not contribute actual two-way user activity.
- Analogue bridges are often implemented using audio mixing/conference style
  behaviour. This is a poor fit for DMR and similar digital modes, which enforce
  one audio source at a time and rely on stream, hang-time and contention
  behaviour rather than mixed audio.
- This mismatch comes partly from analogue repeater heritage: analogue systems
  may maintain a continuous transmit carrier and mix notification sounds such as
  pips, CWID and courtesy tones into the output audio. Analogue systems also
  often have little or no strong source identity, whereas DMR traffic carries a
  DMR ID.
- A common failure mode is that a feed from an analogue repeater keeps the DMR
  stream open between analogue overs, plays courtesy/notification tones and then
  carries the next analogue user in the same held stream. This can hold the TG
  open and prevent a digital station from breaking in until the analogue
  repeater times out and its carrier drops.
- Analogue bridges should therefore be subject to local sysop policy, public
  listing expectations and peer accountability. Permitted does not mean
  automatically valuable or immune from peering/listing consequences.

Other digital network bridges:

- Digital voice networks such as YSF and NXDN are generally a better technical
  match for DMR than analogue networks because they also use AMBE-family vocoder
  audio.
- AMBE-to-AMBE interworking can be lossless at the codec level and avoids
  transcoding artifacts.
- Transcoding from analogue or unlike codecs can degrade audio quality
  significantly and should be treated carefully.

Desired direction:

- Add PKI-backed mesh peer admission to the Bridge Control (`BCXX`) mechanism.
- A peer server presents public identity material signed by a FreeDMR network
  master key or trusted network CA.
- The authenticated identity must bind at least:
  - server ID
  - authorized server sub-IDs
  - public key
  - validity period
  - permitted protocol/features where useful
- Runtime admission should bind the authenticated server identity to the
  observed transport endpoint, including IP address.
- If the observed IP address changes, the FBP peer must perform a new key
  exchange/authentication step before its traffic is forwarded.
- Network membership should be represented by a signed sysop/server key that is
  issued when the sysop/server joins the network and revoked when they leave or
  are compromised. Runtime endpoint/session bindings are renewed separately and
  do not require re-signing the long-lived membership key.
- One successful verification of the signed identity should authorize the
  covered server ID and declared/authorized sub-IDs for that sysop, subject to
  local policy and endpoint/session binding.

Packet-plane rule:

- Expensive signature/certificate validation happens during control-plane
  admission or re-admission, not for every DMR packet.
- Per-packet mesh traffic should use a cached authenticated peer/session state
  check keyed by server ID and endpoint.

Initial conceptual flow:

```text
FBP peer connects/sends keepalive
  -> BC auth exchange presents signed server identity/public key
  -> FreeDMR validates signature against trusted network key
  -> FreeDMR binds server_id + endpoint + protocol features to peer session
  -> DMR traffic is accepted only while that authenticated binding is valid
```

Security requirements:

- Reject unauthenticated FBP traffic by default once this mode is enabled.
- Reject traffic where server ID, key identity and source endpoint do not match
  the authenticated binding.
- Expire authenticated bindings and require renewal.
- Support soft renewal: when an authenticated binding reaches its renewal
  timestamp, schedule asynchronous re-authentication while allowing a bounded
  grace period so in-flight voice is not interrupted purely by renewal timing.
- Hard-stop forwarding only for explicit authentication failure, revoked
  identity/key, endpoint mismatch outside policy, expired grace period, or
  policy requiring immediate re-authentication.
- Log authentication failure reasons clearly without leaking private material.
- Provide a controlled transition mode for existing networks while PKI is rolled
  out.

Open questions:

- Whether to use X.509 certificates, raw Ed25519 public keys with signed
  metadata, or another compact identity format.
- How network master keys/CAs are generated, rotated and revoked.
- Whether peer authorization policy should live in config, MQTT/control-plane
  state, or a signed network membership list.
- How to handle legitimate dynamic-IP servers without weakening endpoint
  binding.
- What renewal and grace-period defaults best preserve voice continuity without
  weakening mesh admission.

### Distributed Key Gossip Option

FreeDMR may also use a peer-to-peer signed-key dissemination mechanism over the
Bridge Control (`BCXX`) out-of-band channel.

Concept:

- Each server periodically advertises the signed server public keys/membership
  documents it knows to its direct FBP peers.
- Peers validate the signatures and build a local table of legitimate server
  identities as knowledge propagates through the mesh.
- Each server uses its local signed-key table and local policy to decide whether
  to route or reject packets that originated from a given source server, even
  when that source server is not directly connected.

Rationale:

- FreeDMR is a peer network, not hub-and-spoke or master/slave.
- Servers are autonomous and independently operated.
- Direct FBP peers should not be blindly trusted to make correct routing
  decisions on behalf of the local server.
- Open-source, human-readable code deliberately lowers the barrier to
  modification, so each server must be able to protect itself from incorrect or
  malicious upstream forwarding decisions.

Security requirements for key gossip:

- Only signed membership documents are accepted; peers cannot create trust by
  merely repeating a key.
- Membership documents need issuer, subject server ID, public key fingerprint,
  authorized sub-IDs, validity period, serial/version and signature.
- Revocation data must propagate by the same or a stronger mechanism.
- Each server must enforce local policy after validation. A valid signed key
  proves membership, not mandatory carriage.
- Key gossip must be rate-limited and bounded so it cannot become a BCXX flood
  or memory-growth vector.
- Received membership data must be replay-resistant enough to handle expiry,
  superseded serials and revoked keys.
- The packet path must use cached key/policy state; signature validation and
  gossip processing are control-plane work.

This complements direct-peer endpoint authentication. Direct-peer auth proves
the connected FBP peer is legitimate for this session; distributed signed-key
knowledge lets the local server make autonomous decisions about traffic whose
source server is elsewhere in the mesh.

## Reporting Protocol Decision

FreeDMR 2.0 should define a structured reporting event protocol and use MQTT as
the preferred external live reporting transport.

Rationale:

- MQTT is already familiar in DMR network dashboard/reporting contexts.
- BrandMeister uses MQTT, providing a useful precedent for dashboard consumers.
- MQTT topics map naturally to server/client/subscription/call state.
- Retained messages are useful for current state snapshots.
- Last Will and Testament can represent server/reporting disconnects.
- MQTT-over-WebSocket allows browser dashboards to subscribe directly when the
  broker supports it.

Constraints:

- MQTT publishing must be asynchronous from the packet worker.
- Packet routing must continue if the MQTT broker/dashboard is down.
- Event generation must be state-change/summary oriented, not per DMR frame.
- The event schema is the compatibility contract; internal Python objects are
  not.
- Local live dashboard and central global lastheard remain separate paths.
- Voice stability takes precedence over reporting completeness. If the system
  must choose between dropping/reporting-losing events and delaying packet
  handling, it must drop or coalesce reporting events.

Implementation requirement:

```text
packet path -> non-blocking local event queue -> MQTT publisher worker
```

The packet path must not call an MQTT broker synchronously. The local event
queue should be bounded. On overflow, the publisher layer should drop or
coalesce low-priority events and emit a later reporting-health event rather than
blocking packet handling.

Suggested event priority:

- retain/coalesce latest state: server/client/slot/subscription state
- keep best effort: call start/end summaries
- drop first under pressure: high-volume debug/warning/statistical updates

MQTT publishing should support reconnect with exponential backoff and should
refresh retained state after reconnect so a dashboard can recover even if
transient events were missed.

Suggested MQTT namespace:

```text
freedmr/v2/{server_id}/state
freedmr/v2/{server_id}/client/{client_id}/state
freedmr/v2/{server_id}/client/{client_id}/slot/{slot}/activity
freedmr/v2/{server_id}/subscription/{subscription_id}/state
freedmr/v2/{server_id}/call/{stream_id}/start
freedmr/v2/{server_id}/call/{stream_id}/end
freedmr/v2/{server_id}/mesh/{peer_id}/state
freedmr/v2/{server_id}/event
```

Use retained messages for current state:

```text
server state
client state
slot activity
subscription state
mesh peer state
```

Use non-retained messages for transient events:

```text
call start/end
loop-control event
source-quench event
packet-rate/loss summary
warnings
```

Example event:

```json
{
  "version": 2,
  "event_id": 1849281,
  "type": "call.started",
  "timestamp": 1710000000.123,
  "server_id": 234099,
  "client_id": 2345001,
  "slot": 2,
  "conference_tg": 4400,
  "rf_tg": 9,
  "source_id": 2351234,
  "stream_id": 16909060,
  "access": "hbp"
}
```

Dashboard delivery options:

- preferred: dashboard subscribes to MQTT over WebSocket
- alternative: local reporting sidecar translates MQTT to SSE/HTTP
- control actions should use authenticated HTTP APIs unless a future UI needs
  bidirectional streaming

## Local Dashboard vs Global Lastheard

Each FreeDMR server has its own local live dashboard. The global lastheard
service is centrally hosted and non-real-time.

Local dashboard:

- consumes local MQTT live state/events
- displays current client/repeater traffic
- must tolerate reconnects and missed transient events by reloading retained
  state topics

Global lastheard:

- consumes call summaries or batched exports
- should not depend on packet-plane or dashboard delivery
- should tolerate central outage via spool/retry

Possible MQTT global feed:

- Each server publishes local live dashboard topics to a local broker or local
  reporting service.
- Prefer a separate exporter process for the curated global feed. The exporter
  subscribes to the same local real-time MQTT feed as the dashboard, filters and
  summarizes what is needed, then publishes to the network MQTT broker or writes
  to the global collector.
- The exporter publishes only summary topics needed for the 30-day database,
  such as call end summaries, client/server presence, selected mesh health and
  selected subscription changes.
- Raw packet events and high-volume live slot updates should not be exported to
  the global broker by default.
- Central broker, global dashboard or exporter failure must not back up into
  local packet processing or local dashboard state.

Preferred flow:

```text
FreeDMR core -> local MQTT feed -> local dashboard
                              -> global-exporter process -> network MQTT/collector
```

Core publishing invariant:

- FreeDMR core emits each reporting event once to its configured local MQTT
  broker/publisher queue.
- Fanout to dashboards, exporters, automation and global collectors is handled
  by the MQTT broker and separate subscriber processes.
- Adding more reporting consumers must not increase FreeDMR packet-process work
  beyond the single local event emission.

Suggested global MQTT subjects:

```text
freedmr/v2/global/{server_id}/call/end
freedmr/v2/global/{server_id}/client/state
freedmr/v2/global/{server_id}/server/state
freedmr/v2/global/{server_id}/mesh/state
```

## Reporting Event Types

Initial event families:

```text
server.started
server.stopping
client.connected
client.disconnected
client.options_changed
subscription.activated
subscription.deactivated
subscription.expired
call.started
call.ended
call.lost
mesh.peer_up
mesh.peer_down
mesh.source_quench
loop.detected
packet.rate_limited
```

## Open Questions

- Which MQTT broker should be packaged by default: Mosquitto, EMQX, NATS MQTT
  compatibility, or another option?
- Should MQTT be mandatory for FreeDMR 2.0 dashboards, or optional with an
  embedded/local fallback?
- What authentication/authorization model should protect MQTT topics and
  dashboard control APIs?
- What retained-topic expiry policy should be used to prevent stale state?
- Should global lastheard consume MQTT directly or use a separate HTTP/queue
  exporter fed from reporting events?
- Should FreeDMR expose a legacy `BRIDGES` compatibility view during migration?