Skip to main content

Database Backup & Restore Runbook

Overview

Agentix uses Railway PostgreSQL as its primary data store. Railway Pro plan provides automatic daily backups with 7-day retention and point-in-time recovery. This runbook documents how to verify backups are enabled, perform manual backups, and restore from both Railway UI and pg_restore. All persistent data lives in PostgreSQL: tenants, users, workflows, workflow versions, contacts, conversations, messages, events, runs, steps, tools, credentials, and audit logs. Data loss here means loss of customer data and workflow definitions. Environment variables:
  • DATABASE_URL — connection string for the primary PostgreSQL instance

1. Verifying Backups Are Enabled

Perform this check monthly or after any Railway infrastructure change.
  1. Open the Railway Dashboard.
  2. Navigate to your project > PostgreSQL service.
  3. Click the Backups tab.
  4. Confirm the following:
    • Automatic Backups is enabled (enabled by default on Pro plan).
    • Retention period shows 7 days.
    • Last backup timestamp is within the last 24 hours.
  5. If automatic backups are disabled:
    • Click Enable Backups (Pro plan required).
    • If on a free plan, upgrade to Pro or implement the manual backup procedure below as a workaround.
Expected state:
Automatic Backups: Enabled
Retention: 7 days
Last Backup: [within 24 hours]

2. Manual Backup (pg_dump)

Use this for additional safety before major migrations, schema changes, or deployment of breaking changes.

Prerequisites

  • pg_dump installed locally (ships with PostgreSQL client tools)
  • DATABASE_URL from Railway environment variables

Procedure

  1. Get the connection string:
    # Option A: Use Railway CLI
    railway connect postgres
    
    # Option B: Copy DATABASE_URL from Railway dashboard
    # Project > PostgreSQL service > Variables tab > DATABASE_URL
    
  2. Run the backup:
    pg_dump -Fc -d "$DATABASE_URL" > agentix_backup_$(date +%Y%m%d_%H%M%S).dump
    
    • -Fc = custom format (compressed, supports selective restore)
    • Output file example: agentix_backup_20260326_143000.dump
  3. Verify the dump file:
    pg_restore --list agentix_backup_*.dump | head -20
    
    Should show table entries (tenants, users, workflows, etc.).
  4. Store the backup off-site:
    # Example: upload to S3
    aws s3 cp agentix_backup_*.dump s3://agentix-backups/$(date +%Y/%m)/
    
    # Or keep locally in a secure directory
    mv agentix_backup_*.dump ~/backups/agentix/
    
TimingTrigger
Before any Prisma migrationManual
Before major deploysManual
Weekly (critical periods)Cron job
Cron example (weekly Sunday 3 AM):
0 3 * * 0 pg_dump -Fc -d "$DATABASE_URL" > /backups/agentix/agentix_backup_$(date +\%Y\%m\%d).dump

3. Restore Procedure (Railway UI)

Use this when Railway automatic backups are available and you need a full restore.

Procedure

  1. Navigate to Railway Dashboard > Project > PostgreSQL service.
  2. Click the Backups tab.
  3. Select the backup by timestamp (choose the most recent backup before the incident).
  4. Click Restore.
    • Railway creates a new database volume with the restored data.
    • The original database is preserved (not overwritten).
  5. Update DATABASE_URL in your Railway project environment variables:
    • Go to API service > Variables tab.
    • Update DATABASE_URL to the new connection string from the restored PostgreSQL instance.
  6. Redeploy the API service:
    # Railway will auto-redeploy on variable change, or trigger manually:
    railway up
    
  7. Verify the restore:
    curl https://api.agentix.app/health
    
    Expected response:
    {"status":"ok","timestamp":"...","checks":{"db":"ok","redis":"ok"}}
    

Rollback

If the restored database has issues:
  1. Revert DATABASE_URL to the original connection string.
  2. Redeploy the API service.
  3. Investigate the restore issue before retrying.

4. Restore Procedure (pg_restore from dump)

Use this when restoring from a manual pg_dump backup, or when Railway UI restore is unavailable.

Procedure

  1. Create a new PostgreSQL service in Railway (or use an existing empty instance):
    • Railway Dashboard > Project > New > Database > PostgreSQL.
    • Copy the new DATABASE_URL from the Variables tab.
  2. Restore the dump:
    pg_restore -d "$NEW_DATABASE_URL" --clean --if-exists agentix_backup_YYYYMMDD_HHMMSS.dump
    
    • --clean drops existing objects before recreating.
    • --if-exists prevents errors if objects do not exist yet.
  3. Run Prisma migrations to ensure schema is up to date:
    DATABASE_URL="$NEW_DATABASE_URL" npx prisma migrate deploy
    
  4. Verify table counts match expected values:
    psql "$NEW_DATABASE_URL" -c "
      SELECT 'tenants' AS tbl, COUNT(*) FROM tenants
      UNION ALL SELECT 'users', COUNT(*) FROM \"user\"
      UNION ALL SELECT 'workflows', COUNT(*) FROM workflows
      UNION ALL SELECT 'contacts', COUNT(*) FROM contacts
      UNION ALL SELECT 'conversations', COUNT(*) FROM conversations
      UNION ALL SELECT 'messages', COUNT(*) FROM messages;
    "
    
  5. Update DATABASE_URL on the API service to point to the restored instance.
  6. Redeploy and verify health:
    curl https://api.agentix.app/health
    

5. Post-Restore Checklist

After any restore (Railway UI or pg_restore), verify:
  • Health endpoint returns {"status":"ok","checks":{"db":"ok","redis":"ok"}}
  • Recent workflow runs are present (check /api/runs or Railway logs)
  • Tenant data integrity: spot-check 2-3 tenants via API or Prisma Studio
    DATABASE_URL="$DATABASE_URL" npx prisma studio
    
  • Published workflows load correctly (check workflow_versions table has entries)
  • Contact tags and groups are intact
  • BullMQ workers are processing jobs (check Railway API service logs)
  • Monitor Sentry for errors in the first 30 minutes post-restore
  • Verify webhook processing is working (send a test WhatsApp message)

6. Testing Schedule

FrequencyActivityPurpose
MonthlyVerify Railway backup tab (step 1)Confirm backups are running
QuarterlyFull restore test to staging databaseValidate restore procedure works end-to-end
Before major migrationsManual pg_dumpSafety net for schema changes

Quarterly Restore Test Procedure

  1. Create a temporary PostgreSQL instance in Railway.
  2. Restore the latest automatic backup (Railway UI) or most recent manual dump.
  3. Point a local API instance at the restored database.
  4. Run the health check and spot-check 2-3 tenants.
  5. Delete the temporary PostgreSQL instance.
  6. Document the test result and date in this section:
Restore Test Log:
DateMethodResultNotes
YYYY-MM-DDRailway UI / pg_restorePass / FailNotes

7. Incident Response

If data loss is detected:
  1. Do not restart or redeploy — this may overwrite recovery options.
  2. Determine the scope: which tables/rows are affected?
  3. Check Railway backup tab for the most recent clean backup.
  4. If Railway backup is available: follow Restore Procedure (Railway UI).
  5. If manual dump is more recent: follow Restore Procedure (pg_restore).
  6. If partial data loss: consider point-in-time queries or selective pg_restore.
  7. After restore: run the Post-Restore Checklist.
  8. Post-incident: document root cause and update this runbook if needed.