Redis Persistence & BullMQ Resilience Runbook
Overview
Agentix uses Railway Redis for:| Use Case | Module | Data Type |
|---|---|---|
| BullMQ job queues | apps/api/src/lib/queue.ts | Jobs (message-processing, broadcast-sending, import-contacts, audit-processing) |
| Rate-limit counters | apps/api/src/lib/quota.ts | Atomic INCR/INCRBY counters with TTL |
| Audit event buffer | apps/api/src/lib/audit.ts | LPUSH list buffer, flushed by audit worker |
| Conversation locks | Worker heartbeat locks | SET with TTL for per-conversation concurrency control |
- In-flight BullMQ jobs may be lost (stall detection handles this).
- Rate-limit counters reset to 0 (quotas re-accumulate — acceptable).
- Audit event buffer loses last few seconds of events (best-effort by design).
- Conversation locks release (workers re-acquire on next job).
REDIS_URL— connection string for the Railway Redis instance
1. Railway Redis Persistence
Railway Redis instances use RDB (Redis Database) snapshots by default. RDB creates point-in-time snapshots of the dataset at configured intervals. Important: Railway Redis does NOT support customredis.conf. Persistence settings are managed by Railway infrastructure. You cannot switch to AOF or modify save intervals via configuration.
What RDB Provides
- Periodic snapshots (typically every 1-5 minutes depending on write frequency)
- Compact on-disk format
- Fast restart from snapshot
- Data loss window: up to the interval since last snapshot (usually seconds to minutes)
What AOF Would Provide (Not Available on Railway)
- Append-only file logging every write operation
- Near-zero data loss on restart
- Larger disk usage, slightly higher latency
- Railway does not currently expose AOF configuration
2. Verifying Persistence Is Active
Via Redis CLI
| Field | Expected Value | Meaning |
|---|---|---|
rdb_last_save_time | Recent Unix timestamp | Last successful RDB snapshot |
rdb_changes_since_last_save | Small number (<1000) | Unsaved changes since last snapshot |
rdb_bgsave_in_progress | 0 or 1 | Whether a snapshot is in progress |
aof_enabled | 0 (RDB only) or 1 (AOF+RDB) | Whether AOF is active |
loading | 0 | Not currently loading a snapshot |
Via Health Check
The API health endpoint checks Redis connectivity:Via Railway Dashboard
- Navigate to Railway Dashboard > Project > Redis service.
- Check the Settings or Variables tab.
- Confirm the Redis service is running and has a volume attached (persistence requires a volume).
3. What Survives a Redis Restart
| Data Type | Survives? | Notes |
|---|---|---|
BullMQ jobs in waiting state | Yes | Stored as Redis keys, restored from RDB snapshot |
BullMQ jobs in delayed state | Yes | Stored as sorted set entries |
BullMQ jobs in active state during crash | Recovered | Moved to stalled by BullMQ stall detection, then retried |
BullMQ jobs in completed/failed | Yes | Kept per removeOnComplete/removeOnFail config |
Rate-limit counters (quota:* keys) | Yes | TTL-based keys, restored from snapshot |
Audit buffer (audit:buffer list) | Partial | May lose last few seconds of buffered events (acceptable) |
| Conversation locks (SET with TTL) | Released | Locks expire naturally; workers re-acquire on next job |
| Rate-limiter counters (request-level) | Yes | TTL-based keys, restored from snapshot |
- BullMQ has built-in stall detection for active jobs.
- Quota counters are approximate by design (see
quota.tsheader comment). - Audit is explicitly best-effort.
4. BullMQ Resilience Configuration
Current Queue Configuration
Fromapps/api/src/lib/queue.ts:
| Queue | Attempts | Backoff | Retention |
|---|---|---|---|
message-processing | 3 | Exponential, 1s base | 1000 completed, 5000 failed |
broadcast-sending | 3 | Exponential, 2s base | 5000 completed, 5000 failed |
import-contacts | 1 | None (no retries) | 1 hour completed, 24 hours failed |
audit-processing | 3 | Exponential, 2s base | 100 completed, 500 failed |
Redis Client Configuration
Fromapps/api/src/lib/redis.ts:
maxRetriesPerRequest: null— Required for BullMQ. Without this, BullMQ operations fail after default 20 retries. Setting tonullmeans “retry indefinitely” which allows BullMQ to handle its own retry logic.retryStrategy— Exponential backoff reconnection (200ms base, 5s cap, 20 max attempts).reconnectOnError— Automatic reconnect on READONLY errors (Redis failover scenarios).enableReadyCheck: true— Waits for Redis to be fully ready before issuing commands.
Stall Detection
BullMQ automatically detects stalled jobs (jobs that wereactive when a worker crashed). Default behavior:
- Stall check interval: 30 seconds (default
stalledInterval) - Max stall count: 1 (default
maxStalledCount) — a job is moved tofailedafter being stalled this many times
5. If Persistence Is Disabled (Emergency)
Symptoms
- After Redis restart, all queues are empty.
INFO persistenceshowsrdb_last_save_time:0or very old timestamp.- Rate-limit counters reset to 0 unexpectedly.
Immediate Actions
-
Check Railway Redis volume:
- Railway Dashboard > Redis service > check if a volume is attached.
- If no volume: Redis is running in-memory only. All data is lost on restart.
- Contact Railway support to enable RDB persistence or attach a volume.
-
Workarounds while persistence is unavailable:
- BullMQ: Jobs are resilient by design. Producers re-enqueue on failure. Workers detect stalled jobs.
- Rate-limit counters: Reset to 0 on restart. Quotas re-accumulate from zero. This is acceptable short-term; tenants may temporarily exceed limits.
- Audit buffer: Events are lost. No workaround needed (audit is best-effort).
- Conversation locks: Release on restart. Workers re-acquire automatically.
- Monitor closely: Watch Sentry for Redis connection errors and BullMQ job failures.
6. Monitoring
Health Check
The API health endpoint (/health) checks Redis via PING:
/health as the healthcheck path (configured in railway.toml). Failed health checks trigger automatic container restart (restartPolicyType: on_failure, max 3 retries).
Connection Event Logging
The Redis client logs connection lifecycle events via Pino:Redis connected— TCP connection establishedRedis ready— Connection authenticated and ready for commandsRedis connection error— Connection error with detailsRedis connection closed— Connection lostRedis reconnecting— Attempting reconnection with retry delay
Sentry Alerts
Redis connection errors are logged aterror level and captured by Sentry. Configure Sentry alerts for:
Redis connection error(immediate alert)- Health check returning
redis: error(degraded state)
Manual Checks
7. Maintenance Schedule
| Frequency | Activity | Purpose |
|---|---|---|
| Weekly | Check rdb_last_save_time via INFO persistence | Confirm snapshots are running |
| Monthly | Review queue lengths and memory usage | Detect growth trends |
| After deploy | Check health endpoint | Confirm Redis connectivity |
| After Redis restart | Check INFO persistence + queue lengths | Verify data survived |