Scaling to 1,000 sites with zero downtime

The scaling problem nobody warns you about

Most SaaS platforms scale by adding compute. Access control platforms scale differently because every site has physical hardware — door controllers, readers, intercoms — that maintain persistent connections to the cloud and must continue operating when that connection drops. At 50 sites, you can run a monolith behind a load balancer. At 1,000 sites spanning 14 time zones, you are operating a distributed system where the edge nodes are embedded devices bolted to walls in buildings you have never visited.

When we crossed 200 sites in early 2025, our original architecture started showing stress fractures. A database migration that locked a table for 9 seconds caused 340 controllers to reconnect simultaneously, which cascaded into a thundering herd that saturated our WebSocket tier. Every site was affected. That incident became the catalyst for the cell-based redesign that now powers EntryBit at scale.

Cell-based architecture: isolation by design

The core insight was blast radius containment. A failure in Site A should be invisible to Site B. We achieved this by partitioning the platform into cells — independent, self-contained stacks that each serve a subset of customers.

Each cell runs the full EntryBit stack: API servers, event processors, WebSocket fan-out nodes, a dedicated PostgreSQL cluster, and a Redis layer for real-time state. Cells are deployed in AWS regions closest to their customer concentration. Cell US-East-1 serves 180 sites across the eastern United States. Cell EU-West-1 covers 220 sites in Western Europe. Cell AP-Southeast-1 handles 95 sites in Southeast Asia. We currently operate 7 cells across 4 regions.

Cell assignment is determined at tenant onboarding and is transparent to the customer. A global routing layer, backed by a lightweight Consul-based service mesh, directs each controller’s connection to the correct cell. Cross-cell operations — like a global admin querying events across all sites — are handled by an aggregation service that issues parallel fan-out queries to each cell and merges results, with a 5-second timeout per cell to prevent one slow cell from blocking the entire response.

The payoff: in the 14 months since deploying cell-based isolation, we have had 3 cell-level incidents. Each affected fewer than 15% of total sites, and the remaining cells continued serving traffic with zero degradation. Our platform-wide availability over that period is 99.997%.

Controller offline resilience

Physical access control has a non-negotiable requirement that most cloud software ignores: the system must work when the internet is down. If a building loses connectivity, employees must still be able to badge through doors. Locks cannot default to locked-out because a fiber cut happened three blocks away.

EntryBit controllers are designed for autonomous operation. Each controller caches the full access policy for its assigned doors — typically 2,000 to 50,000 credential-to-door mappings stored in 4MB of flash memory. When connectivity drops, the controller evaluates access decisions locally using the cached policy, queues events in a 512KB ring buffer, and continues operating indefinitely.

The sync protocol handles reconnection gracefully. When connectivity restores, the controller performs a delta sync: it sends its last-known event sequence number, and the cloud responds with any policy changes that occurred during the disconnection. The controller replays its queued events upstream. The entire resync completes in under 2 seconds for a typical 4-hour outage window. We have tested offline windows of up to 72 hours with zero event loss.

Crucially, policy updates issued during an outage are queued server-side and delivered in order upon reconnection. If an administrator revoked a credential while the building was offline, the revocation takes effect the moment the controller reconnects — not at the next scheduled sync.

Event replay and the immutable event log

Access control generates an extraordinary volume of events. Across our 1,000-site fleet, we ingest 23 million events per day: door grants, denials, alarms, controller heartbeats, firmware status reports, and policy change records. Every event is immutable and must be retained for 7 years to satisfy enterprise audit requirements and various regional compliance mandates.

Our event pipeline is built on a three-tier storage architecture. The hot tier uses per-cell Kafka clusters with a 72-hour retention window, serving real-time dashboards and alerting. The warm tier is a ClickHouse cluster optimized for analytical queries over the trailing 90 days — this powers the reports, audit log search, and anomaly detection training. The cold tier archives to S3-compatible object storage in Parquet format, partitioned by tenant and date, with AES-256 encryption at rest.

Event replay is a first-class operation. When a customer requests an incident reconstruction, an operator specifies a time range and site, and the replay service reconstitutes the event stream from whichever storage tier holds the data. Replay from the hot tier delivers at 50x real-time speed. Replay from cold storage delivers at 8x speed after an initial 10-15 second retrieval latency for the relevant Parquet partitions.

Database sharding strategy

Each cell’s PostgreSQL cluster uses horizontal sharding by tenant ID. We evaluated several sharding approaches — hash-based, range-based, and directory-based — and chose directory-based sharding with Citus for its operational flexibility.

The shard directory maps each tenant to a specific shard. New tenants are assigned to the shard with the lowest current load, measured by storage size and query throughput. Rebalancing is performed online using logical replication: the source shard streams changes to the destination shard until they converge, at which point the directory is updated atomically. Tenants experience zero downtime during rebalancing.

Each cell runs 8 to 16 shards depending on tenant density. The largest cell, US-East-1, manages 2.1TB of active data across 16 shards. Query latency at P99 is 12ms for single-tenant lookups and 45ms for cross-shard aggregations used in global reporting.

We learned one hard lesson early: foreign keys across shards do not work. Our schema was redesigned to ensure every query path resolves within a single shard. Cross-tenant queries — used only by internal tools and the global admin aggregation layer — are explicitly routed through the aggregation service, which handles fan-out and merge.

Graceful degradation and load shedding

At scale, you cannot prevent every failure. You can control how failures manifest. EntryBit implements four layers of graceful degradation.

Circuit breakers on every inter-service call trip after 5 consecutive failures or a 60% error rate over a 10-second window. When tripped, the calling service uses cached data or returns a degraded response rather than propagating the failure. Priority queuing ensures that door access decisions — the most critical operation — are always processed ahead of analytics events, dashboard refreshes, and report generation. Under extreme load, non-critical operations are shed entirely while access decisions continue at full throughput. Regional failover allows a cell to redirect controller connections to a secondary cell in a neighboring region within 30 seconds, triggered manually or by automated health checks. Synthetic monitoring runs 1,200 probe transactions per minute across all cells. A probe simulates a full access decision cycle: credential lookup, policy evaluation, event recording, and WebSocket broadcast. Any probe exceeding 200ms triggers an alert. Three consecutive failures trigger automated investigation runbooks.

Conclusion

Scaling to 1,000 sites taught us that reliability in physical access control is a fundamentally different problem than reliability in typical SaaS. The edge devices are real, the consequences of downtime are physical, and the data retention requirements are measured in years. Cell-based isolation, controller-native offline resilience, an immutable event pipeline, and aggressive load shedding gave us a platform that absorbs failures at every layer without propagating them to the lock on the door. That is the metric that matters: no matter what breaks in the cloud, the door still works.