Skip to main content

Redis Persistence & BullMQ Resilience Runbook

Overview

Agentix uses Railway Redis for:
Use CaseModuleData Type
BullMQ job queuesapps/api/src/lib/queue.tsJobs (message-processing, broadcast-sending, import-contacts, audit-processing)
Rate-limit countersapps/api/src/lib/quota.tsAtomic INCR/INCRBY counters with TTL
Audit event bufferapps/api/src/lib/audit.tsLPUSH list buffer, flushed by audit worker
Conversation locksWorker heartbeat locksSET with TTL for per-conversation concurrency control
Data loss on Redis restart means:
  • In-flight BullMQ jobs may be lost (stall detection handles this).
  • Rate-limit counters reset to 0 (quotas re-accumulate — acceptable).
  • Audit event buffer loses last few seconds of events (best-effort by design).
  • Conversation locks release (workers re-acquire on next job).
Environment variables:
  • REDIS_URL — connection string for the Railway Redis instance

1. Railway Redis Persistence

Railway Redis instances use RDB (Redis Database) snapshots by default. RDB creates point-in-time snapshots of the dataset at configured intervals. Important: Railway Redis does NOT support custom redis.conf. Persistence settings are managed by Railway infrastructure. You cannot switch to AOF or modify save intervals via configuration.

What RDB Provides

  • Periodic snapshots (typically every 1-5 minutes depending on write frequency)
  • Compact on-disk format
  • Fast restart from snapshot
  • Data loss window: up to the interval since last snapshot (usually seconds to minutes)

What AOF Would Provide (Not Available on Railway)

  • Append-only file logging every write operation
  • Near-zero data loss on restart
  • Larger disk usage, slightly higher latency
  • Railway does not currently expose AOF configuration

2. Verifying Persistence Is Active

Via Redis CLI

# Connect to Railway Redis
redis-cli -u "$REDIS_URL" INFO persistence
Key fields to check:
FieldExpected ValueMeaning
rdb_last_save_timeRecent Unix timestampLast successful RDB snapshot
rdb_changes_since_last_saveSmall number (<1000)Unsaved changes since last snapshot
rdb_bgsave_in_progress0 or 1Whether a snapshot is in progress
aof_enabled0 (RDB only) or 1 (AOF+RDB)Whether AOF is active
loading0Not currently loading a snapshot
Example healthy output:
rdb_last_save_time:1774538400
rdb_changes_since_last_save:42
rdb_bgsave_in_progress:0
aof_enabled:0
loading:0

Via Health Check

The API health endpoint checks Redis connectivity:
curl https://api.agentix.app/health
Expected:
{"status":"ok","timestamp":"...","checks":{"db":"ok","redis":"ok"}}
If Redis is down, the health check returns:
{"status":"degraded","timestamp":"...","checks":{"db":"ok","redis":"error"}}

Via Railway Dashboard

  1. Navigate to Railway Dashboard > Project > Redis service.
  2. Check the Settings or Variables tab.
  3. Confirm the Redis service is running and has a volume attached (persistence requires a volume).

3. What Survives a Redis Restart

Data TypeSurvives?Notes
BullMQ jobs in waiting stateYesStored as Redis keys, restored from RDB snapshot
BullMQ jobs in delayed stateYesStored as sorted set entries
BullMQ jobs in active state during crashRecoveredMoved to stalled by BullMQ stall detection, then retried
BullMQ jobs in completed/failedYesKept per removeOnComplete/removeOnFail config
Rate-limit counters (quota:* keys)YesTTL-based keys, restored from snapshot
Audit buffer (audit:buffer list)PartialMay lose last few seconds of buffered events (acceptable)
Conversation locks (SET with TTL)ReleasedLocks expire naturally; workers re-acquire on next job
Rate-limiter counters (request-level)YesTTL-based keys, restored from snapshot
Data loss window: Up to the interval since the last RDB snapshot (typically 1-5 minutes). For Agentix, this is acceptable because:
  • BullMQ has built-in stall detection for active jobs.
  • Quota counters are approximate by design (see quota.ts header comment).
  • Audit is explicitly best-effort.

4. BullMQ Resilience Configuration

Current Queue Configuration

From apps/api/src/lib/queue.ts:
QueueAttemptsBackoffRetention
message-processing3Exponential, 1s base1000 completed, 5000 failed
broadcast-sending3Exponential, 2s base5000 completed, 5000 failed
import-contacts1None (no retries)1 hour completed, 24 hours failed
audit-processing3Exponential, 2s base100 completed, 500 failed

Redis Client Configuration

From apps/api/src/lib/redis.ts:
  • maxRetriesPerRequest: nullRequired for BullMQ. Without this, BullMQ operations fail after default 20 retries. Setting to null means “retry indefinitely” which allows BullMQ to handle its own retry logic.
  • retryStrategy — Exponential backoff reconnection (200ms base, 5s cap, 20 max attempts).
  • reconnectOnError — Automatic reconnect on READONLY errors (Redis failover scenarios).
  • enableReadyCheck: true — Waits for Redis to be fully ready before issuing commands.

Stall Detection

BullMQ automatically detects stalled jobs (jobs that were active when a worker crashed). Default behavior:
  • Stall check interval: 30 seconds (default stalledInterval)
  • Max stall count: 1 (default maxStalledCount) — a job is moved to failed after being stalled this many times
Recommended worker settings (apply when creating BullMQ Workers):
new Worker('message-processing', processor, {
  connection: redis,
  stalledInterval: 30_000,   // Check for stalled jobs every 30s
  maxStalledCount: 2,        // Allow 1 stall recovery before failing
});

5. If Persistence Is Disabled (Emergency)

Symptoms

  • After Redis restart, all queues are empty.
  • INFO persistence shows rdb_last_save_time:0 or very old timestamp.
  • Rate-limit counters reset to 0 unexpectedly.

Immediate Actions

  1. Check Railway Redis volume:
    • Railway Dashboard > Redis service > check if a volume is attached.
    • If no volume: Redis is running in-memory only. All data is lost on restart.
  2. Contact Railway support to enable RDB persistence or attach a volume.
  3. Workarounds while persistence is unavailable:
    • BullMQ: Jobs are resilient by design. Producers re-enqueue on failure. Workers detect stalled jobs.
    • Rate-limit counters: Reset to 0 on restart. Quotas re-accumulate from zero. This is acceptable short-term; tenants may temporarily exceed limits.
    • Audit buffer: Events are lost. No workaround needed (audit is best-effort).
    • Conversation locks: Release on restart. Workers re-acquire automatically.
  4. Monitor closely: Watch Sentry for Redis connection errors and BullMQ job failures.

6. Monitoring

Health Check

The API health endpoint (/health) checks Redis via PING:
try { await redis.ping(); checks.redis = 'ok'; }
catch { checks.redis = 'error'; }
Railway uses /health as the healthcheck path (configured in railway.toml). Failed health checks trigger automatic container restart (restartPolicyType: on_failure, max 3 retries).

Connection Event Logging

The Redis client logs connection lifecycle events via Pino:
  • Redis connected — TCP connection established
  • Redis ready — Connection authenticated and ready for commands
  • Redis connection error — Connection error with details
  • Redis connection closed — Connection lost
  • Redis reconnecting — Attempting reconnection with retry delay

Sentry Alerts

Redis connection errors are logged at error level and captured by Sentry. Configure Sentry alerts for:
  • Redis connection error (immediate alert)
  • Health check returning redis: error (degraded state)

Manual Checks

# Check Redis is responding
redis-cli -u "$REDIS_URL" PING
# Expected: PONG

# Check persistence info
redis-cli -u "$REDIS_URL" INFO persistence

# Check BullMQ queue lengths
redis-cli -u "$REDIS_URL" LLEN bull:message-processing:wait
redis-cli -u "$REDIS_URL" LLEN bull:broadcast-sending:wait
redis-cli -u "$REDIS_URL" LLEN bull:audit-processing:wait

# Check memory usage
redis-cli -u "$REDIS_URL" INFO memory | grep used_memory_human

7. Maintenance Schedule

FrequencyActivityPurpose
WeeklyCheck rdb_last_save_time via INFO persistenceConfirm snapshots are running
MonthlyReview queue lengths and memory usageDetect growth trends
After deployCheck health endpointConfirm Redis connectivity
After Redis restartCheck INFO persistence + queue lengthsVerify data survived