You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
FreeDMR/docs/freedmr-2/13-worker-process-scaling.md

11 KiB

Worker Process Scaling

Decision

FreeDMR 2 should be designed so that, after the protocol/routing/subscription core has been extracted and tested, selected parts of the system can be moved into separate worker processes to improve capacity, isolate failures, and avoid the practical single-thread/GIL limits of one Python process.

This is not a first-stage rewrite requirement.

The first stage remains:

  • Keep Twisted initially.
  • Extract the deterministic core.
  • Make state ownership explicit.
  • Preserve packet behaviour through tests.

But the FreeDMR 2 state model must not block a later move to worker processes.

Why Worker Processes

CPython single-process execution has practical limits for CPU-bound Python code.

Twisted's single reactor is useful as an initial safety boundary, but one reactor process should not be assumed to be the final capacity architecture.

Worker processes provide stronger ownership and failure boundaries than shared-memory threads. Process/message boundaries are safer for FreeDMR routing state than no-GIL shared mutable dictionaries.

Worker processes fit FreeDMR's need for explicit ownership of routing, stream, subscription, loop-control, and reporting state.

Worker-process scaling must not make ordinary small FreeDMR deployments heavyweight or hard to run. A single-server deployment on a cheap VPS or Raspberry Pi-class system must remain supported.

What Should Be Offloaded First

Low-risk early offload candidates:

  • MQTT/reporting publisher.
  • Global lastheard exporter.
  • Dashboard aggregation.
  • SQL/database writing.
  • Historical analytics.
  • Alias download/refresh.
  • Expensive codec experiments.
  • Packet capture/replay analysis.
  • Non-critical maintenance jobs.
  • Future transcoding/bridge adjuncts.
  • Future network-analysis or observability tools.

These workers must be asynchronous from the packet path.

If they are slow, blocked, crashed, overloaded, or absent, packet routing must continue.

Reporting completeness is secondary to voice stability.

What Should Not Be Offloaded Early

Do not initially offload hot-path routing decisions if doing so would add IPC, network, database, lock, queue, or back-pressure waits to every DMR packet.

Specifically keep local and deterministic until the model is proven:

  • Stream admission.
  • Duplicate suppression.
  • Loop control.
  • Source quench checks.
  • Dial-a-TG state mutation.
  • Subscription lookup.
  • Slot/TG rewrite decisions.
  • Voice/data classification.
  • Packet mutation/rewrite.
  • HBP RF-facing tolerance logic.
  • Protocol-version-sensitive FBP/OBP metadata handling.

The packet plane must continue to use local in-memory state and must not depend on external databases, MQTT, dashboards, APIs, or reporting consumers.

Possible Long-Term Worker Architecture

Transport/listener Process

Owns:

  • UDP sockets.
  • HBP receive/send.
  • FBP/OBP receive/send.
  • Raw packet admission.
  • Socket identity.
  • Keepalive.
  • Low-level protocol parsing.
  • Forwarding packet events to the owning routing component.

Routing Core Worker

Owns:

  • Subscription state.
  • Stream state.
  • Dial-a-TG state.
  • Loop-control state.
  • Source-quench state.
  • Duplicate suppression.
  • Packet routing decisions.
  • Explicit packet rewrite decisions.
  • Authoritative packet-plane state for its assigned clients/streams/TGs.

Reporting Worker

Owns:

  • MQTT publishing.
  • Retained state refresh.
  • Reporting event queue.
  • Dashboard event fanout.
  • Reporting health events.
  • Drop/coalesce policy under pressure.

Global Exporter Worker

Owns:

  • Subscribing to local reporting feed.
  • Filtering and summarising local events.
  • Publishing curated summaries to global lastheard/network collector.
  • Retry/spool policy for central outage.

Control/API Worker or Control-Plane Adapter

Owns:

  • Sysop/API requests.
  • Validating control-plane credentials.
  • Converting API requests into explicit state-change commands.
  • Receiving ControlResult messages.
  • Never directly mutating packet-plane state it does not own.

Optional Future Codec/Transcode/Analysis Workers

Own:

  • Expensive or experimental codec work.
  • Transcoding adjuncts.
  • Packet replay analysis.
  • Offline diagnostics.
  • Future lab features.

These must remain outside the live packet hot path unless explicitly proven safe.

State Ownership Rules

  • Every mutable authoritative state object must have exactly one owner.
  • Other processes may hold snapshots or caches, but only the owner mutates authoritative state.
  • Do not use multiprocessing.Manager().dict() or shared mutable proxy objects as the main architecture.
  • Do not recreate a cross-process global BRIDGES-style mutable structure.
  • Use explicit messages/events instead of pretending cross-process state is a normal Python dict.
  • Packet bytes crossing process boundaries should be immutable.
  • Packet mutation must remain explicit, named, and testable.
  • A process boundary must not hide unclear ownership.
  • State ownership must be visible in tests and documentation.

Message Boundary

Likely internal message families:

Message Plane
PacketReceived packet-plane
PacketAccepted packet-plane
PacketDropped packet-plane
RouteDecision packet-plane
PacketToSend packet-plane
PacketMutated packet-plane
StreamStarted packet-plane
StreamEnded packet-plane
StreamLost packet-plane
SubscriptionActivated control-plane
SubscriptionDeactivated control-plane
SubscriptionExpired packet-plane
SourceQuenchReceived packet-plane
SourceQuenchSendRequested packet-plane
StunActivated control-plane
StunCleared control-plane
ReportingEvent reporting-plane
ControlCommand control-plane
ControlResult control-plane
WorkerStarted worker/supervision-plane
WorkerStopping worker/supervision-plane
WorkerHealth worker/supervision-plane
WorkerBackpressure worker/supervision-plane
WorkerCrashed worker/supervision-plane
WorkerRestarted worker/supervision-plane

Packet-plane messages must be compact, bounded, and safe for high frequency use.

Partitioning / Sharding Options

Possible future sharding models, without choosing one prematurely:

  • By client/repeater DMR ID.
  • By listener/access socket.
  • By conference TG.
  • By source server / mesh peer.
  • By stream ID.
  • Hybrid model.

Constraints:

  • All packets for a given live stream must be processed in order by the same stream owner.
  • Dial-a-TG state for one client/slot must have one owner.
  • Subscription state for one client/slot must have one owner.
  • Loop-control/source-quench state must be consistent for a given TG/stream/source path.
  • Cross-worker routing must not reintroduce duplicate packets or loops.
  • Worker assignment must be observable and testable.
  • Worker assignment must not depend on dashboard/reporting state.
  • Sharding must preserve the FreeDMR "everything everywhere" mesh principle, subject to existing source quench, STUN, ACL, policy, and authentication rules.

Coordinator Model

FreeDMR 2 may eventually need a lightweight coordinator.

The coordinator may:

  • Assign clients/sessions/TGs to workers.
  • Distribute subscription snapshots.
  • Manage worker health.
  • Restart workers.
  • Publish control-plane updates.
  • Provide routing-worker discovery.
  • Coordinate graceful drain/restart.

The coordinator must not:

  • Synchronously participate in every packet routing decision.
  • Become a single blocking dependency for live voice.
  • Hide packet-plane state in an external database.
  • Make ordinary small deployments require clustered infrastructure.

Single-process/small-server deployment must remain supported. A coordinator should be optional or internal for simple deployments.

Failure Behaviour

  • Reporting worker failure: packet routing continues.
  • Global exporter failure: local service continues.
  • Dashboard aggregation failure: packet routing continues.
  • API/control worker failure: existing packet routing continues, but new control actions may fail.
  • Alias refresh worker failure: current aliases remain in use.
  • Analytics worker failure: packet routing continues.
  • Routing worker failure: affected sessions/streams are dropped or restarted according to explicit policy.
  • Transport/listener failure: affected sockets/sessions are lost until restart.
  • Worker restart must not replay stale DMR packets.
  • Retained/reporting state may be refreshed after recovery.
  • In-flight voice may be lost during worker crash, but failure must not poison the mesh or produce loops.
  • Backpressure from non-packet workers must not propagate into the packet path.

Tests Required Before Worker Split

Before moving any packet-plane component into a separate process, require:

  • Deterministic harness coverage of the state machine.
  • UDP black-box coverage of the same behaviour.
  • Message-boundary tests proving packet bytes and route decisions are preserved.
  • Failure-injection tests for worker timeout/crash/restart.
  • Queue-backpressure tests proving reporting/control workers cannot block packets.
  • Tests proving no stale packet replay after worker restart.
  • Tests proving source quench, STUN, loop-control, and duplicate suppression are preserved across the boundary.
  • Live RF validation for protocol-visible behaviour.

Worker split must not be considered complete until deterministic, UDP, and live RF validation agree for protocol-visible paths.

Migration Path

Stage A: Extract pure/deterministic routing/subscription state behind explicit interfaces.

Stage B: Introduce internal message/event objects in-process.

Stage C: Move reporting/MQTT/global export to separate processes.

Stage D: Move slow maintenance, alias refresh, SQL/global lastheard, and analytics work to workers.

Stage E: Experiment with routing core as a child process behind the same message interface.

Stage F: Evaluate multi-routing-worker sharding only after the single routing-worker process model is stable and fully tested.

Stage G: Only after the above, consider whether transport/listener processes should be split by listener, client set, protocol, or deployment role.

Explicit Non-Goals

  • Do not introduce shared mutable cross-process BRIDGES-like state.
  • Do not depend on Redis/Postgres/MQTT for per-packet routing decisions.
  • Do not require heavyweight infrastructure for ordinary single-server deployments.
  • Do not use worker processes to hide unclear state ownership.
  • Do not move protocol-sensitive packet mutation across process boundaries until byte-preservation and rewrite tests prove equivalence.
  • Do not assume no-GIL Python solves FreeDMR's state ownership problem.
  • Do not replace Twisted and introduce worker sharding in the same step.
  • Do not make reporting, dashboard, SQL, global lastheard, or API availability part of the packet routing path.

Powered by TurnKey Linux.