Bike4Mind Backup & Disaster Recovery Runbook
1. Data Inventory & Backup Requirements
Data Type | Storage Location | Criticality | Backup Frequency | Retention Period |
---|---|---|---|---|
User Data | MongoDB | Critical | Continuous | 30 days |
Session Data | MongoDB | High | Daily | 7 days |
Uploaded Files | S3 (fabFilesBucket) | Critical | Versioning + Replication | 30 days |
Generated Images | S3 (generatedImagesBucket) | Medium | Versioning + Replication | 14 days |
Application Files | S3 (appFilesBucket) | High | Versioning + Replication | 30 days |
Payment Information | Stripe (external) | Critical | Managed by Stripe | N/A |
System Configuration | AWS Parameter Store | High | Daily | 90 days |
Infrastructure Code | GitHub | Critical | Per commit | Indefinite |
2. Current Backup Implementation
2.1 MongoDB Backup
MongoDB Atlas provides automated backups with the following configuration:
// Documentation of existing MongoDB Atlas backup configuration
const mongoBackupConfig = {
provider: "MongoDB Atlas",
backupMethod: "Continuous Cloud Backup",
retentionPeriod: "30 days",
pointInTimeRecovery: true,
encryptionAtRest: true,
location: "AWS us-east-2 (primary)"
};
2.2 S3 Bucket Configuration
Update S3 bucket configurations in SST config to enable versioning and replication:
// Update in sst.config.ts for each critical bucket
const fabFilesBucket = new Bucket(stack, 'fabFilesBucket', {
cdk: {
bucket: fabFileBucketName
? {
bucketName: fabFileBucketName,
removalPolicy: RemovalPolicy.RETAIN,
versioned: true, // Enable versioning for all production buckets
encryption: BucketEncryption.S3_MANAGED, // Ensure encryption at rest
}
: {
autoDeleteObjects: true,
removalPolicy: RemovalPolicy.DESTROY,
versioned: app.stage === 'production', // Only version in production
encryption: BucketEncryption.S3_MANAGED,
},
},
});
// For production, add cross-region replication
if (app.stage === 'production') {
new s3deploy.BucketDeployment(stack, 'EnableReplication', {
sources: [s3deploy.Source.data('empty', {})],
destinationBucket: fabFilesBucket.cdk.bucket,
destinationKeyPrefix: '.config',
destinationBucketProps: {
replicationConfiguration: {
role: new iam.Role(stack, 'ReplicationRole', {
assumedBy: new iam.ServicePrincipal('s3.amazonaws.com'),
}),
rules: [
{
destination: {
bucket: s3.Bucket.fromBucketAttributes(stack, 'BackupBucket', {
bucketName: `${fabFileBucketName}-backup`,
region: 'us-west-2' // Secondary region
}).bucketArn,
storageClass: s3.StorageClass.STANDARD,
},
prefix: '',
status: 'Enabled',
},
],
},
},
});
}
2.3 Infrastructure Configuration Backup
// Add to sst.config.ts for backing up infrastructure configuration
new Cron(stack, 'configBackup', {
schedule: 'cron(0 0 * * ? *)', // Daily at midnight UTC
job: {
function: {
handler: 'packages/client/server/cron/configBackup.handler',
permissions: ['ssm:GetParameters', 's3:PutObject'],
environment: {
CONFIG_BACKUP_BUCKET: 'bike4mind-config-backup'
}
},
},
});
Implementation of the config backup handler:
// packages/client/server/cron/configBackup.ts
export const handler = async () => {
const ssm = new AWS.SSM();
const s3 = new AWS.S3();
// Get all parameters with path
const params = await ssm.getParametersByPath({
Path: '/bike4mind/',
Recursive: true,
WithDecryption: true
}).promise();
// Prepare backup object (excluding actual secret values)
const backupData = {
timestamp: new Date().toISOString(),
parameters: params.Parameters?.map(param => ({
name: param.Name,
type: param.Type,
version: param.Version,
lastModified: param.LastModifiedDate,
// Don't include actual values for SecureString types
value: param.Type !== 'SecureString' ? param.Value : '[ENCRYPTED]'
}))
};
// Save to S3
await s3.putObject({
Bucket: process.env.CONFIG_BACKUP_BUCKET!,
Key: `config-backup-${new Date().toISOString()}.json`,
Body: JSON.stringify(backupData, null, 2),
ServerSideEncryption: 'AES256'
}).promise();
return { success: true, timestamp: new Date().toISOString() };
};
3. Recovery Objectives
Environment | RTO (Recovery Time Objective) | RPO (Recovery Point Objective) |
---|---|---|
Production | 1 hour | 5 minutes |
Staging | 4 hours | 1 hour |
Development | 8 hours | 24 hours |
4. Recovery Runbook
4.1 MongoDB Recovery Procedure
# MongoDB Recovery Procedure
## Prerequisites
- MongoDB Atlas admin access
- AWS CLI configured with appropriate permissions
- Latest application deployment package
## Procedure
### 1. Assess the Outage
- Determine if this is a temporary connectivity issue or data corruption
- Identify affected databases and collections
### 2. Point-in-Time Recovery (Data Corruption)
1. Log in to MongoDB Atlas dashboard
2. Navigate to "Backup" section for the affected cluster
3. Select "Restore" option
4. Choose the appropriate backup point (timestamp) before corruption
5. Select "Point-in-Time Recovery" for precision
6. Choose destination:
- For testing: Create a new cluster
- For production: Restore to existing cluster
7. Initiate recovery and monitor progress
8. Verify data integrity after restore completes
### 3. Cluster Failover (Region Outage)
1. Log in to MongoDB Atlas dashboard
2. Navigate to "Clusters" section
3. For the affected cluster, click the "..." menu
4. Select "Initiate Failover"
5. Confirm the action
6. Monitor the failover process (typically 30-60 seconds)
7. Verify application connectivity to the new primary
### 4. Complete Cluster Restore (Catastrophic Failure)
1. Log in to MongoDB Atlas dashboard
2. Navigate to "Projects" section
3. Create a new project if needed
4. Click "Build a Cluster"
5. Choose configuration matching the original cluster
6. Once provisioned, go to "Backup" section
7. Select "Restore" and choose the source cluster/backup
8. Select destination as the new cluster
9. Initiate recovery and monitor progress
10. Update application connection string to point to new cluster
### 5. Verification
1. Run database connectivity test: `node packages/scripts/verify-db-connection.js`
2. Check document counts match expected values
3. Verify application functionality through critical user flows
4. Monitor error rates and performance metrics
## Contact Information
- Primary: Database Administrator (dba@bike4mind.com)
- Secondary: Lead Backend Developer
- Escalation: CTO
4.2 S3 Data Recovery Procedure
# S3 Bucket Recovery Procedure
## Prerequisites
- AWS CLI configured with appropriate permissions
- IAM role with S3 access
## Recovery Scenarios
### 1. Accidentally Deleted Objects (Using Versioning)
1. Identify the deleted objects and their versions:
aws s3api list-object-versions --bucket <bucket-name> --prefix <object-key>
2. Restore the previous version by copying it over the deletion marker:
aws s3api copy-object --copy-source <bucket-name>/<object-key>?versionId=<version-id> --bucket <bucket-name> --key <object-key>
3. Alternatively, use the AWS Console to restore versions:
- Navigate to S3 bucket
- Enable "Show versions"
- Select the previous version
- Choose "Download" or "Restore"
### 2. Bucket Corruption or Regional Outage
1. If bucket versioning is enabled, recover using previous versions
2. If cross-region replication is configured, use the replica:
aws s3 sync s3://<backup-bucket-name>/ s3://<primary-bucket-name>/
3. For complete recovery from backup:
aws s3 sync s3://<backup-bucket-name>/ s3://<new-bucket-name>/
4. Update application configuration to point to the new bucket if needed
### 3. Accidental Bucket Deletion
1. Create a new bucket with the same name (if available)
aws s3api create-bucket --bucket <bucket-name> --region <region>
2. Apply the same bucket policy and configuration:
aws s3api put-bucket-policy --bucket <bucket-name> --policy file://bucket-policy.json aws s3api put-bucket-versioning --bucket <bucket-name> --versioning-configuration Status=Enabled
3. Restore data from backup bucket:
aws s3 sync s3://<backup-bucket-name>/ s3://<bucket-name>/
4. Verify data integrity and accessibility
## Verification
Run the verification script to check critical objects:
node packages/scripts/verify-s3-content.js
## Contact Information
- Primary: Cloud Infrastructure Engineer
- Secondary: DevOps Lead
- Escalation: CTO
4.3 Full Application Recovery Procedure
# Full Application Recovery Procedure
## Prerequisites
- AWS administrator access
- SST deployment access
- MongoDB Atlas admin access
- Source code access (GitHub)
## Recovery Procedure
### 1. Assess the Outage
- Identify affected components (database, storage, compute)
- Determine the scope (region-specific or multi-region)
- Assemble recovery team
### 2. Infrastructure Recovery
1. If using a multi-region strategy:
- Update Route53 DNS to point to backup region
- Verify DNS propagation using `dig` or similar tools
2. If rebuilding in the same region:
- Verify AWS service status in the region
- Deploy infrastructure using SST:
npx sst deploy --stage production
### 3. Database Recovery
1. Follow the MongoDB Recovery Procedure (Section 4.1)
2. Verify database connectivity and data integrity
### 4. Storage Recovery
1. Follow the S3 Recovery Procedure (Section 4.2)
2. Update application configuration if necessary
### 5. Application Deployment
1. Deploy application code to the recovered infrastructure:
npx sst deploy --stage production
2. Verify deployment success
### 6. Verification
1. Run comprehensive health check:
node packages/scripts/system-health-check.js --environment production
2. Verify critical user flows:
- Authentication
- File operations
- API functionality
3. Monitor application metrics and logs
### 7. Post-Recovery
1. Document incident details
2. Update recovery procedure if needed
3. Schedule post-mortem analysis
4. Implement preventive measures
## Contact Information
- Incident Commander: CTO
- Database Recovery: Database Administrator
- Infrastructure Recovery: DevOps Lead
- Application Recovery: Lead Developer
5. Recovery Test Plan
5.1 Regular Testing Schedule
Test Type | Frequency | Environment | Notification |
---|---|---|---|
MongoDB Point-in-Time Recovery | Monthly | Staging | 3 business days |
S3 Object Version Recovery | Quarterly | All | 1 business day |
Full DR Simulation | Semi-annually | Production | 2 weeks |
5.2 MongoDB Recovery Test Procedure
// packages/scripts/test-mongo-recovery.ts
export async function testMongoRecovery() {
// 1. Create test collection with known data
// 2. Record timestamp
// 3. Perform modifying operations (update/delete)
// 4. Initiate point-in-time recovery to the timestamp
// 5. Verify data matches the original state
// 6. Clean up test collection
}
5.3 S3 Recovery Test Procedure
// packages/scripts/test-s3-recovery.ts
export async function testS3Recovery() {
// 1. Upload test file with known content
// 2. Overwrite with different content or delete
// 3. Recover using versioning
// 4. Verify recovered content matches original
// 5. Clean up test files
}
5.4 Full DR Test Procedure
For full disaster recovery test, create a detailed runbook:
# DR Test Runbook
## Preparation
1. Create a test plan including:
- Test timeline
- Success criteria
- Team roles and responsibilities
- Communication channels
2. Notify stakeholders of planned test
3. Prepare rollback plan
## Execution
1. Simulate failure scenario (region outage, data corruption)
2. Initiate DR procedure following Section 4.3
3. Document each step and timing
4. If using a backup region:
- Verify application functionality in backup region
- Test DNS failover
5. If restoring from backups:
- Verify data integrity
- Confirm application functionality
## Evaluation
1. Measure actual RTO and RPO achieved
2. Compare against objectives
3. Document any issues or bottlenecks
4. Update recovery procedures based on findings
## Cleanup
1. Restore primary environment
2. Cleanup test resources
3. Return to normal operations
6. Implementation Plan (80/20 Approach)
Backup Configuration
- Enable versioning on all S3 buckets
- Verify MongoDB Atlas backup configuration
- Document current backup procedures
Recovery Testing
- Create test scripts for database recovery
- Test S3 versioning and restoration
- Document recovery procedures
Runbook Finalization
- Define clear RPO/RTO objectives
- Create comprehensive recovery runbooks
- Set up scheduled DR testing plan
7. AWS Service Integration
7.1 AWS Backup Service
Consider implementing AWS Backup for centralized management:
// Add to sst.config.ts
import * as backup from 'aws-cdk-lib/aws-backup';
// Create backup vault
const backupVault = new backup.BackupVault(stack, 'BackupVault', {
backupVaultName: `${appName}-backup-vault-${app.stage}`,
encryption: backup.BackupVaultEncryption.DEFAULT,
});
// Create backup plan
const backupPlan = new backup.BackupPlan(stack, 'BackupPlan', {
backupPlanName: `${appName}-backup-plan-${app.stage}`,
backupVault,
});
// Add backup rules
backupPlan.addRule(new backup.BackupPlanRule({
ruleName: 'DailyBackups',
scheduleExpression: events.Schedule.cron({
minute: '0',
hour: '5'
}),
deleteAfter: Duration.days(30),
moveToColdStorageAfter: Duration.days(7),
}));
// Add resources to backup plan
const resources = [
// Add S3 buckets and other supported resources
];
resources.forEach(resource => {
backupPlan.addSelection(`Selection-${resource.node.id}`, {
resources: [backup.BackupResource.fromArn(resource.arnForBackup)],
});
});
7.2 Cross-Region Replication for Critical Resources
For production, implement automated recovery using Route53 failover:
// Add to sst.config.ts
import * as route53 from 'aws-cdk-lib/aws-route53';
import * as route53targets from 'aws-cdk-lib/aws-route53-targets';
// Create health check for primary region
const healthCheck = new route53.HealthCheck(stack, 'PrimaryHealthCheck', {
type: route53.HealthCheckType.HTTP,
resourcePath: '/health',
fullyQualifiedDomainName: `app.${domain}`,
port: 443,
failureThreshold: 2,
});
// Create failover record
const hostedZone = route53.HostedZone.fromLookup(stack, 'HostedZone', {
domainName: domain,
});
new route53.RecordSet(stack, 'FailoverRecord', {
zone: hostedZone,
recordName: `app.${domain}`,
recordType: route53.RecordType.A,
failover: route53.RecordSetFailover.PRIMARY,
healthCheck,
target: route53.RecordTarget.fromAlias(
new route53targets.CloudFrontTarget(site.cdk.distribution)
)
});
8. Emergency Access Procedures 🚨
8.1 Maintenance Mode Lockout Recovery
Scenario: Admin enables maintenance mode but cannot log back in because maintenance mode blocks the login page.
Symptoms:
- Cannot access login page (shows maintenance page instead)
- JWT tokens expired, requiring re-authentication
- Admin interface inaccessible
Emergency Recovery Procedure:
Step 1: Environment Setup
# Load environment variables for the target stage
source ~/.zshrc
use-b4m-dev # or use-b4m-prod for production
# Verify MongoDB URI is available
env | grep -i mongo
Step 2: Create Emergency Script
Create file emergency-disable-maintenance.cjs
:
#!/usr/bin/env node
/**
* EMERGENCY MAINTENANCE MODE DISABLE SCRIPT
*
* This script connects directly to MongoDB and disables maintenance mode
* Use this when you're locked out due to maintenance mode being enabled
*
* Usage: node emergency-disable-maintenance.cjs
*/
const { MongoClient } = require('mongodb');
async function disableMaintenanceMode() {
let client;
try {
// Get MongoDB URI from environment
const mongoUri = process.env.MONGODB_URI;
if (!mongoUri) {
console.error('❌ MONGODB_URI environment variable not set!');
console.log('💡 Set it with: export MONGODB_URI="your-mongodb-connection-string"');
process.exit(1);
}
console.log('🔌 Connecting to MongoDB...');
client = new MongoClient(mongoUri);
await client.connect();
const db = client.db();
const adminSettings = db.collection('adminsettings');
console.log('🔍 Checking current server status...');
const currentSetting = await adminSettings.findOne({ settingName: 'serverStatus' });
if (currentSetting) {
console.log(`📊 Current server status: ${currentSetting.settingValue}`);
if (currentSetting.settingValue === 'live') {
console.log('✅ Server is already live! No changes needed.');
return;
}
} else {
console.log('⚠️ No serverStatus setting found. Creating one...');
}
console.log('🔧 Setting server status to LIVE...');
const result = await adminSettings.updateOne(
{ settingName: 'serverStatus' },
{
$set: {
settingValue: 'live',
updatedAt: new Date()
}
},
{ upsert: true }
);
if (result.acknowledged) {
console.log('🎉 SUCCESS! Maintenance mode has been DISABLED');
console.log('✅ Server status set to LIVE');
console.log('🚀 You should now be able to access the login page');
console.log('');
console.log('💡 Note: You may need to refresh your browser or clear cache');
} else {
console.error('❌ Failed to update server status');
}
} catch (error) {
console.error('💥 Error:', error.message);
console.log('');
console.log('🆘 TROUBLESHOOTING:');
console.log('1. Check your MONGODB_URI is correct');
console.log('2. Ensure MongoDB is accessible');
console.log('3. Verify network connectivity');
} finally {
if (client) {
await client.close();
console.log('🔐 MongoDB connection closed');
}
}
}
// Show usage info
console.log('🚨 EMERGENCY MAINTENANCE MODE DISABLE SCRIPT 🚨');
console.log('==============================================');
console.log('');
// Run the script
disableMaintenanceMode();
Step 3: Install Dependencies & Execute
# Install MongoDB driver (workspace root)
pnpm add mongodb --save-dev -w
# Get MongoDB URI from SST secrets
npx sst secrets list | grep MONGODB_URI
# Run emergency script with proper MongoDB URI
# Replace %STAGE% with your current stage (erik, dev, production)
MONGODB_URI="mongodb+srv://dev:nF8cz1ax0YiSRkR1@cluster0.5c7akt1.mongodb.net/[STAGE]?retryWrites=true&w=majority" node emergency-disable-maintenance.cjs
# Clean up when done
rm emergency-disable-maintenance.cjs
Step 4: Verify Recovery
- Refresh browser or clear cache
- Access login page - should now work
- Log in with admin credentials
- Verify maintenance mode is disabled in admin settings
8.2 Other Emergency Database Operations
Reset User Password
// emergency-reset-password.cjs
const { MongoClient } = require('mongodb');
const bcrypt = require('bcrypt');
async function resetUserPassword(username, newPassword) {
const client = new MongoClient(process.env.MONGODB_URI);
await client.connect();
const db = client.db();
const users = db.collection('users');
const hashedPassword = await bcrypt.hash(newPassword, 10);
const result = await users.updateOne(
{ username: username },
{ $set: { password: hashedPassword, updatedAt: new Date() } }
);
console.log(result.acknowledged ? '✅ Password reset successful' : '❌ Password reset failed');
await client.close();
}
Grant Emergency Admin Access
// emergency-grant-admin.cjs
const { MongoClient } = require('mongodb');
async function grantAdminAccess(username) {
const client = new MongoClient(process.env.MONGODB_URI);
await client.connect();
const db = client.db();
const users = db.collection('users');
const result = await users.updateOne(
{ username: username },
{ $set: { isAdmin: true, updatedAt: new Date() } }
);
console.log(result.acknowledged ? '✅ Admin access granted' : '❌ Failed to grant admin access');
await client.close();
}
Disable Rate Limiting
// emergency-disable-rate-limiting.cjs
const { MongoClient } = require('mongodb');
async function disableRateLimiting() {
const client = new MongoClient(process.env.MONGODB_URI);
await client.connect();
const db = client.db();
const adminSettings = db.collection('adminsettings');
// Disable any rate limiting settings
const result = await adminSettings.updateMany(
{ settingName: { $in: ['enableRateLimit', 'maxRequestsPerMinute'] } },
{ $set: { settingValue: false, updatedAt: new Date() } }
);
console.log(`✅ Updated ${result.modifiedCount} rate limiting settings`);
await client.close();
}
8.3 Prevention Strategies
8.3.1 Emergency Admin Access URL 🆘
Implementation: We've implemented a special emergency admin endpoint that completely bypasses maintenance mode.
Emergency Access URL: https://app.yourdomain.com/admin-emergency
Features:
- Completely bypasses maintenance mode - Always accessible regardless of server status
- Separate authentication flow - Independent of normal login system
- Admin-only access - Validates admin credentials and permissions
- Comprehensive audit logging - All emergency access attempts are logged
- Secure design - Hidden from normal users, requires admin credentials
Usage During Emergency:
- Navigate to
https://app.yourdomain.com/admin-emergency
- Enter admin username and password
- System validates admin credentials and logs access
- Redirects to admin panel with emergency access flag
- Disable maintenance mode from admin settings
Security Features:
- All emergency access attempts logged with:
- User ID and username
- Timestamp and IP address
- User agent and source
- Headers set:
X-Emergency-Access: true
- Analytics event logged as LOGIN with 'local' strategy
- Console logs for immediate visibility
Implementation Details:
// Frontend: packages/client/pages/admin-emergency.tsx
AdminEmergencyPage.auth = {
allowUnauthenticated: true, // Bypasses ALL restrictions
};
// Backend: packages/client/pages/api/admin/emergency-login.ts
// - Validates admin credentials with bcryptjs
// - Checks isAdmin flag and ban status
// - Comprehensive audit logging
// - Returns auth tokens for admin panel access
8.3.2 Future Enhancement: Email Emergency Link (Backlogged)
For future implementation, consider adding emergency email notifications:
// Send emergency access link when maintenance is enabled
const sendEmergencyAccessEmail = async (adminEmail: string) => {
const emergencyLink = `https://app.${domain}/admin-emergency`;
await sendEmail({
to: adminEmail,
subject: '🚨 Maintenance Mode Enabled - Emergency Access Available',
html: `
<h2>Maintenance Mode Activated</h2>
<p>The system has been placed in maintenance mode.</p>
<p>Emergency admin access is available at:</p>
<a href="${emergencyLink}">Emergency Admin Access</a>
<p>Use your normal admin credentials to access the system.</p>
`,
});
};
8.4 Emergency Contact Procedures
8.4.1 Escalation Matrix
Incident Type | Primary Contact | Secondary Contact | Escalation Time |
---|---|---|---|
Maintenance Lockout | Lead Developer | DevOps Engineer | 30 minutes |
Database Access | Database Admin | CTO | 15 minutes |
Security Breach | Security Lead | CTO | Immediate |
Production Outage | On-Call Engineer | Engineering Manager | 15 minutes |
8.4.2 Emergency Communication Channels
Slack Channels:
#alerts-critical
- Immediate response required#ops-emergency
- Operational emergencies#security-incident
- Security-related incidents
Phone Numbers:
- On-Call Engineer: +1-XXX-XXX-XXXX
- CTO: +1-XXX-XXX-XXXX
- Security Lead: +1-XXX-XXX-XXXX
8.5 Emergency Access Audit Log
Create a log entry for all emergency access procedures:
// packages/client/server/utils/emergencyAuditLog.ts
export const logEmergencyAccess = async (action: string, user: string, reason: string) => {
const auditEntry = {
timestamp: new Date().toISOString(),
action,
user,
reason,
ipAddress: req.ip,
userAgent: req.headers['user-agent'],
type: 'EMERGENCY_ACCESS'
};
// Log to multiple destinations
await Promise.all([
// Database log
db.auditLogs.create(auditEntry),
// Slack notification
sendSlackAlert('🚨 Emergency Access Used', auditEntry),
// Email notification
sendEmailAlert('security@bike4mind.com', 'Emergency Access Alert', auditEntry)
]);
};
Usage in emergency scripts:
// Add to emergency scripts
console.log('📝 Logging emergency access...');
// Log the emergency action (implement based on your logging system)
8.6 Regular Testing Schedule
Test Type | Frequency | Environment | Notes |
---|---|---|---|
Maintenance Mode Recovery | Monthly | Staging | Test admin lockout scenario |
Emergency Script Validation | Quarterly | All | Verify scripts work with current schema |
Admin Bypass Testing | Quarterly | Staging | Test emergency access methods |
Communication Channels | Monthly | N/A | Verify Slack/email notifications work |
⚠️ IMPORTANT SECURITY NOTES:
- Never commit emergency scripts to version control
- Always delete emergency scripts after use
- Log all emergency access for audit purposes
- Rotate emergency codes regularly
- Test emergency procedures in non-production environments first
- Keep MongoDB credentials secure and rotated regularly
Conclusion
This comprehensive backup and recovery plan ensures Bike4Mind can meet its resilience objectives while providing clear, actionable procedures for recovery scenarios. By implementing the automated backup solutions and regularly testing recovery procedures, the system will be well-prepared to handle various failure scenarios with minimal data loss and downtime.
The plan is designed to evolve as the application grows and as recovery requirements change, with clear documentation and testing procedures to ensure ongoing effectiveness.