Redis Persistence & BullMQ Resilience Runbook

Overview

Agentix uses Railway Redis for:

Use Case	Module	Data Type
BullMQ job queues	`apps/api/src/lib/queue.ts`	Jobs (message-processing, broadcast-sending, import-contacts, audit-processing)
Rate-limit counters	`apps/api/src/lib/quota.ts`	Atomic INCR/INCRBY counters with TTL
Audit event buffer	`apps/api/src/lib/audit.ts`	LPUSH list buffer, flushed by audit worker
Conversation locks	Worker heartbeat locks	SET with TTL for per-conversation concurrency control

Data loss on Redis restart means:

In-flight BullMQ jobs may be lost (stall detection handles this).
Rate-limit counters reset to 0 (quotas re-accumulate — acceptable).
Audit event buffer loses last few seconds of events (best-effort by design).
Conversation locks release (workers re-acquire on next job).

Environment variables:

REDIS_URL — connection string for the Railway Redis instance

1. Railway Redis Persistence

Railway Redis instances use RDB (Redis Database) snapshots by default. RDB creates point-in-time snapshots of the dataset at configured intervals. Important: Railway Redis does NOT support custom redis.conf. Persistence settings are managed by Railway infrastructure. You cannot switch to AOF or modify save intervals via configuration.

What RDB Provides

Periodic snapshots (typically every 1-5 minutes depending on write frequency)
Compact on-disk format
Fast restart from snapshot
Data loss window: up to the interval since last snapshot (usually seconds to minutes)

What AOF Would Provide (Not Available on Railway)

Append-only file logging every write operation
Near-zero data loss on restart
Larger disk usage, slightly higher latency
Railway does not currently expose AOF configuration

2. Verifying Persistence Is Active

Via Redis CLI

# Connect to Railway Redis
redis-cli -u "$REDIS_URL" INFO persistence

Key fields to check:

Field	Expected Value	Meaning
`rdb_last_save_time`	Recent Unix timestamp	Last successful RDB snapshot
`rdb_changes_since_last_save`	Small number (`<1000`)	Unsaved changes since last snapshot
`rdb_bgsave_in_progress`	`0` or `1`	Whether a snapshot is in progress
`aof_enabled`	`0` (RDB only) or `1` (AOF+RDB)	Whether AOF is active
`loading`	`0`	Not currently loading a snapshot

Example healthy output:

rdb_last_save_time:1774538400
rdb_changes_since_last_save:42
rdb_bgsave_in_progress:0
aof_enabled:0
loading:0

Via Health Check

The API health endpoint checks Redis connectivity:

curl https://api.agentix.app/health

Expected:

{"status":"ok","timestamp":"...","checks":{"db":"ok","redis":"ok"}}

If Redis is down, the health check returns:

{"status":"degraded","timestamp":"...","checks":{"db":"ok","redis":"error"}}

Via Railway Dashboard

Navigate to Railway Dashboard > Project > Redis service.
Check the Settings or Variables tab.
Confirm the Redis service is running and has a volume attached (persistence requires a volume).

3. What Survives a Redis Restart

Data Type	Survives?	Notes
BullMQ jobs in `waiting` state	Yes	Stored as Redis keys, restored from RDB snapshot
BullMQ jobs in `delayed` state	Yes	Stored as sorted set entries
BullMQ jobs in `active` state during crash	Recovered	Moved to `stalled` by BullMQ stall detection, then retried
BullMQ jobs in `completed`/`failed`	Yes	Kept per `removeOnComplete`/`removeOnFail` config
Rate-limit counters (`quota:*` keys)	Yes	TTL-based keys, restored from snapshot
Audit buffer (`audit:buffer` list)	Partial	May lose last few seconds of buffered events (acceptable)
Conversation locks (SET with TTL)	Released	Locks expire naturally; workers re-acquire on next job
Rate-limiter counters (request-level)	Yes	TTL-based keys, restored from snapshot

Data loss window: Up to the interval since the last RDB snapshot (typically 1-5 minutes). For Agentix, this is acceptable because:

BullMQ has built-in stall detection for active jobs.
Quota counters are approximate by design (see quota.ts header comment).
Audit is explicitly best-effort.

4. BullMQ Resilience Configuration

Current Queue Configuration

From apps/api/src/lib/queue.ts:

Queue	Attempts	Backoff	Retention
`message-processing`	3	Exponential, 1s base	1000 completed, 5000 failed
`broadcast-sending`	3	Exponential, 2s base	5000 completed, 5000 failed
`import-contacts`	1	None (no retries)	1 hour completed, 24 hours failed
`audit-processing`	3	Exponential, 2s base	100 completed, 500 failed

Redis Client Configuration

From apps/api/src/lib/redis.ts:

maxRetriesPerRequest: null — Required for BullMQ. Without this, BullMQ operations fail after default 20 retries. Setting to null means “retry indefinitely” which allows BullMQ to handle its own retry logic.
retryStrategy — Exponential backoff reconnection (200ms base, 5s cap, 20 max attempts).
reconnectOnError — Automatic reconnect on READONLY errors (Redis failover scenarios).
enableReadyCheck: true — Waits for Redis to be fully ready before issuing commands.

Stall Detection

BullMQ automatically detects stalled jobs (jobs that were active when a worker crashed). Default behavior:

Stall check interval: 30 seconds (default stalledInterval)
Max stall count: 1 (default maxStalledCount) — a job is moved to failed after being stalled this many times

Recommended worker settings (apply when creating BullMQ Workers):

new Worker('message-processing', processor, {
  connection: redis,
  stalledInterval: 30_000,   // Check for stalled jobs every 30s
  maxStalledCount: 2,        // Allow 1 stall recovery before failing
});

5. If Persistence Is Disabled (Emergency)

Symptoms

After Redis restart, all queues are empty.
INFO persistence shows rdb_last_save_time:0 or very old timestamp.
Rate-limit counters reset to 0 unexpectedly.

Immediate Actions

Check Railway Redis volume:
- Railway Dashboard > Redis service > check if a volume is attached.
- If no volume: Redis is running in-memory only. All data is lost on restart.
Contact Railway support to enable RDB persistence or attach a volume.
Workarounds while persistence is unavailable:
- BullMQ: Jobs are resilient by design. Producers re-enqueue on failure. Workers detect stalled jobs.
- Rate-limit counters: Reset to 0 on restart. Quotas re-accumulate from zero. This is acceptable short-term; tenants may temporarily exceed limits.
- Audit buffer: Events are lost. No workaround needed (audit is best-effort).
- Conversation locks: Release on restart. Workers re-acquire automatically.
Monitor closely: Watch Sentry for Redis connection errors and BullMQ job failures.

6. Monitoring

Health Check

The API health endpoint (/health) checks Redis via PING:

try { await redis.ping(); checks.redis = 'ok'; }
catch { checks.redis = 'error'; }

Railway uses /health as the healthcheck path (configured in railway.toml). Failed health checks trigger automatic container restart (restartPolicyType: on_failure, max 3 retries).

Connection Event Logging

The Redis client logs connection lifecycle events via Pino:

Redis connected — TCP connection established
Redis ready — Connection authenticated and ready for commands
Redis connection error — Connection error with details
Redis connection closed — Connection lost
Redis reconnecting — Attempting reconnection with retry delay

Sentry Alerts

Redis connection errors are logged at error level and captured by Sentry. Configure Sentry alerts for:

Redis connection error (immediate alert)
Health check returning redis: error (degraded state)

Manual Checks

# Check Redis is responding
redis-cli -u "$REDIS_URL" PING
# Expected: PONG

# Check persistence info
redis-cli -u "$REDIS_URL" INFO persistence

# Check BullMQ queue lengths
redis-cli -u "$REDIS_URL" LLEN bull:message-processing:wait
redis-cli -u "$REDIS_URL" LLEN bull:broadcast-sending:wait
redis-cli -u "$REDIS_URL" LLEN bull:audit-processing:wait

# Check memory usage
redis-cli -u "$REDIS_URL" INFO memory | grep used_memory_human

7. Maintenance Schedule

Frequency	Activity	Purpose
Weekly	Check `rdb_last_save_time` via `INFO persistence`	Confirm snapshots are running
Monthly	Review queue lengths and memory usage	Detect growth trends
After deploy	Check health endpoint	Confirm Redis connectivity
After Redis restart	Check `INFO persistence` + queue lengths	Verify data survived

Getting Started

Runbooks

Redis Persistence & BullMQ Resilience

Redis Persistence & BullMQ Resilience Runbook

Overview

1. Railway Redis Persistence

What RDB Provides

What AOF Would Provide (Not Available on Railway)

2. Verifying Persistence Is Active

Via Redis CLI

Via Health Check

Via Railway Dashboard

3. What Survives a Redis Restart

4. BullMQ Resilience Configuration

Current Queue Configuration

Redis Client Configuration

Stall Detection

5. If Persistence Is Disabled (Emergency)

Symptoms

Immediate Actions

6. Monitoring

Health Check

Connection Event Logging

Sentry Alerts

Manual Checks

7. Maintenance Schedule

Getting Started

Runbooks

​Redis Persistence & BullMQ Resilience Runbook

​Overview

​1. Railway Redis Persistence

​What RDB Provides

​What AOF Would Provide (Not Available on Railway)

​2. Verifying Persistence Is Active

​Via Redis CLI

​Via Health Check

​Via Railway Dashboard

​3. What Survives a Redis Restart

​4. BullMQ Resilience Configuration

​Current Queue Configuration

​Redis Client Configuration

​Stall Detection

​5. If Persistence Is Disabled (Emergency)

​Symptoms

​Immediate Actions

​6. Monitoring

​Health Check

​Connection Event Logging

​Sentry Alerts

​Manual Checks

​7. Maintenance Schedule

Redis Persistence & BullMQ Resilience Runbook

Overview

1. Railway Redis Persistence

What RDB Provides

What AOF Would Provide (Not Available on Railway)

2. Verifying Persistence Is Active

Via Redis CLI

Via Health Check

Via Railway Dashboard

3. What Survives a Redis Restart

4. BullMQ Resilience Configuration

Current Queue Configuration

Redis Client Configuration

Stall Detection

5. If Persistence Is Disabled (Emergency)

Symptoms

Immediate Actions

6. Monitoring

Health Check

Connection Event Logging

Sentry Alerts

Manual Checks

7. Maintenance Schedule