Skip to main content

Bike4Mind Data Classification and InfoSec Framework

1. Data Types and Classification

Data CategoryClassificationSourceStorageOwnerSensitivity
User Authentication DataConfidentialUser registration, OAuth providersMongoDBCompanyHigh
User Content (Sessions)ProprietaryUser-generatedMongoDBCustomerMedium-High
Uploaded Files (FabFiles)ProprietaryUser uploadsS3 (fabFilesBucket)CustomerMedium-High
Generated ImagesProprietarySystem-generatedS3 (generatedImagesBucket)CustomerMedium
Application FilesProprietaryUser uploadsS3 (appFilesBucket)CustomerMedium
Payment InformationSensitiveStripeStripe (tokenized)CompanyHigh
Analysis Data/MetadataProprietarySystem-generatedMongoDBSharedMedium
Usage MetricsInternalSystem-generatedMongoDBCompanyLow
AI Training DataProprietarySystem & user-generatedMongoDB, Vector DBSharedMedium-High

2. Data Sources and Flows

2.1 User-Provided Data

  • Registration Data: Email, name, password (hashed)
  • OAuth Data: Profile information from Google, GitHub, Okta
  • Uploaded Content: Documents, files, and other artifacts
  • Session Content: User interactions, prompts, and responses
  • Payment Information: Processed through Stripe

2.2 System-Generated Data

  • Analysis Results: AI-processed content and metadata
  • Generated Images: AI-created images from prompts
  • User Activity Metrics: Usage patterns and engagement data
  • Application Logs: System performance and error logs
  • Vector Embeddings: For semantic search and retrieval

3. Data Storage Locations

3.1 MongoDB Collections

  • User accounts and profiles
  • Sessions and interaction history
  • System settings and configurations
  • Metadata and relationships between entities
  • Activity logs and metrics

3.2 S3 Buckets

  • fabFilesBucket: User-uploaded documents and resources
  • generatedImagesBucket: AI-generated images
  • appFilesBucket: Application-specific files
  • historyImportBucket: Temporary storage for imported history

3.3 External Services

  • Stripe: Payment information (tokenized)
  • OAuth Providers: Authentication tokens

4. Data Ownership

4.1 Customer-Owned Data

  • User-uploaded content and files
  • User-generated sessions and conversations
  • Generated outputs from customer prompts
  • Custom settings and preferences

4.2 Company-Owned Data

  • System configuration and settings
  • Anonymized usage statistics
  • Internal operational metrics
  • System-generated logs

4.3 Shared Ownership

  • Generated insights that build on customer data
  • Improvements to AI models based on usage patterns
  • Metadata derived from user interactions

5. Compliance and Handling Requirements

5.1 Authentication Data

  • Passwords stored using bcrypt hashing
  • JWT tokens with appropriate expiration (access: 1 day, refresh: 2 days)
  • OAuth tokens managed securely with encryption

5.2 Customer Content

  • Access controlled via CASL permission system
  • Sharing permissions explicitly defined and enforced
  • Content access limited to authorized users

5.3 Payment Information

  • Never stored directly in application databases
  • Processed through Stripe's PCI-compliant systems
  • Only tokenized references stored in application

5.4 System Credentials

  • Managed via SST Secrets
  • Environment-specific configurations
  • Never exposed to client applications

6. Business Processes and Data Lifecycle

6.1 User Registration and Authentication

  • Collects minimal necessary information
  • Supports multiple authentication providers
  • Records login activity for security monitoring

6.2 Content Management

  • User uploads processed through secure channels
  • Files stored with access control enforced
  • Sharing controlled via explicit permissions

6.3 AI Processing

  • User content processed for insights
  • Vector embeddings created for semantic search
  • Generated content attributed to source materials

6.4 Monitoring and Reporting

  • Activity logs for security and performance
  • Error reporting with appropriate data minimization
  • Usage metrics for business intelligence

7. Data Protection Controls

7.1 Access Control

  • Authentication: Multi-strategy authentication system
  • Authorization: CASL-based permission framework
  • Resource Access: Fine-grained control at the document level

7.2 Data Transmission

  • HTTPS for all client-server communication
  • Signed URLs for S3 object access
  • Controlled CORS policy for resource sharing

7.3 Data Storage

  • MongoDB with proper access controls
  • S3 buckets with appropriate permissions
  • Temporary data automatically purged

7.4 Monitoring and Logging

  • Security-relevant events logged
  • Error reporting with sensitive data redacted
  • Activity monitoring for anomaly detection

8. Data Lifecycle Management

8.1 Collection

  • Minimal necessary data collected
  • Clear purpose for all data points
  • User consent for data collection

8.2 Processing

  • Processing limited to stated purposes
  • Transformations documented and traceable
  • Generation of derivative data tracked

8.3 Storage

  • Appropriate retention periods for different data types
  • Versioning where appropriate (S3 buckets)
  • Backup and recovery procedures

8.4 Deletion

  • Clear process for data removal
  • User ability to delete their own data
  • Automated cleanup of temporary storage

9. Automated Classification and Controls

9.1 Current Implementation

  • Type-based classification through schema definitions
  • Permission-based controls enforced by CASL
  • Automated content processing pipelines

9.2 Future Enhancements

  • Machine learning-based sensitive content detection
  • Automated PII identification and protection
  • Pattern-based data classification for unstructured content

10. Security Principles and Best Practices

10.1 Least Privilege

  • Users only see data they have explicit permission to access
  • API routes require specific permissions
  • System components have minimal necessary access

10.2 Defense in Depth

  • Multiple security layers (auth, permissions, monitoring)
  • Security at different architectural levels
  • Redundant controls for critical systems

10.3 Secure by Design

  • Security integrated into development process
  • Consistent use of security patterns
  • Regular security reviews and updates

Conclusion

This data classification framework provides a foundation for understanding and protecting data within the Bike4Mind system. By clearly defining data types, ownership, storage locations, and handling requirements, we can implement appropriate controls and ensure compliance with security best practices. This framework should evolve as the system grows and additional data types or compliance requirements emerge.

Yes, your system already implements many security and data classification best practices! The codebase shows:

  • Strong authentication with multiple strategies
  • Fine-grained authorization using CASL
  • Proper secrets management through SST
  • Well-defined storage architecture with appropriate access controls
  • Input validation using Zod schemas
  • Error handling that respects security boundaries

Your JWT implementation, S3 bucket configuration, and permission model are particularly well done. The clear separation between core business logic in b4m-core and the application layers shows good architecture.

You could still enhance a few areas:

  1. More explicit tagging of data categories in your schemas
  2. Automated PII detection in user-generated content
  3. More formalized data retention policies

But overall, you're definitely ahead of the curve on these requirements. The security architecture you've built provides a solid foundation for the controls described in the AWS questionnaire.

Bike4Mind Security Enhancement Hackathon Plan

Focus Areas (80/20 Approach)

For a focused hackathon with two strong developers, we'll target high-impact improvements that can be implemented efficiently:

1. Explicit Data Classification Tagging

High-Impact Implementation:

  • Create a simple DataClassification enum:

    export enum DataClassification {
    PUBLIC, // No restrictions
    INTERNAL, // Organization-wide access only
    CONFIDENTIAL, // Restricted access
    SENSITIVE, // Highly restricted
    PII // Personal Identifiable Information
    }
  • Apply classifications to critical models first:

    • User (highest priority - contains email, auth data)
    • FabFile (customer intellectual property)
    • Session (conversation content)
  • Extend baseApi middleware to log access to classified fields:

    // Add to existing baseApi middleware chain
    router.use(async (req, res, next) => {
    const originalSend = res.send;
    res.send = function(body) {
    // Check if response contains classified data
    if (containsClassifiedData(body)) {
    req.logger.info(`Access to classified data: ${req.method} ${req.url}`, {
    userId: req.user?.id,
    dataTypes: getClassificationTypes(body)
    });
    }
    return originalSend.call(this, body);
    };
    next();
    });

2. LLM-Based PII Detection

High-Impact Implementation:

  • Leverage your existing LLM infrastructure for PII detection:

    // Create a service in b4m-core/packages/core/services
    export class PIIDetectionService {
    async detectPII(text: string): Promise<PIIDetectionResult> {
    // Use your existing Claude/AI connection
    const prompt = `
    Identify any Personal Identifiable Information (PII) in the following text.
    Return a JSON object with the following structure:
    {
    "hasPII": boolean,
    "detectedItems": [
    {
    "type": "email|phone|address|name|ssn|other",
    "startIndex": number,
    "endIndex": number,
    "confidence": number
    }
    ]
    }

    Text to analyze: ${text}
    `;

    const response = await this.llmService.complete(prompt, {
    temperature: 0,
    maxTokens: 500
    });

    return JSON.parse(response);
    }
    }
  • Integrate at critical points:

    1. File upload pipeline
    2. Message/chat content submission
    3. Bulk content import
  • Create simple middleware wrapper:

    export function piiScanMiddleware(options = { autoRedact: false }) {
    return async (req, res, next) => {
    if (req.body.content && !req.skipPIICheck) {
    const piiService = new PIIDetectionService();
    const result = await piiService.detectPII(req.body.content);

    req.piiDetectionResult = result;

    if (result.hasPII) {
    req.logger.warn('PII detected in request', {
    types: result.detectedItems.map(i => i.type),
    userId: req.user?.id
    });

    if (options.autoRedact) {
    req.body.content = piiService.redactPII(req.body.content, result);
    }
    }
    }
    next();
    };
    }

3. Data Retention Framework

High-Impact Implementation:

  • Define a simple retention schema:

    const RetentionPolicySchema = z.object({
    dataType: z.string(),
    retentionPeriod: z.number(), // In days
    legalHold: z.boolean().default(false)
    });
  • Add retention metadata to critical models:

    {
    // Add to existing schemas
    retentionInfo: {
    expiresAt: { type: Date },
    policyId: { type: String },
    legalHold: { type: Boolean, default: false }
    }
    }
  • Create a daily cleanup Lambda using existing SST config:

    // Add to sst.config.ts
    new Cron(stack, 'dataRetentionJob', {
    schedule: 'cron(0 0 * * ? *)', // Daily at midnight
    job: {
    function: {
    handler: 'packages/client/server/cron/dataRetention.handler',
    bind: [MONGODB_URI],
    },
    },
    });
  • Implement handler for the most critical data types first:

    // Simple implementation focused on temporary files first
    export const handler = async () => {
    // Connect to database
    await connectDB();

    // Start with temp files/imports
    const expiredImports = await TempImport.find({
    createdAt: { $lt: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) }
    });

    // Delete expired files
    for (const item of expiredImports) {
    await filesStorage.deleteObject(item.path);
    await item.delete();
    }

    return { deleted: expiredImports.length };
    };

Hackathon Structure and Implementation Plan

Day 1: Foundation

  • Morning: Schema updates and classification enum
  • Afternoon: Start PII detection service implementation

Day 2: Integration

  • Morning: Complete PII detection and testing
  • Afternoon: Data retention schema and prototype cleanup job

Day 3: Polish & Documentation

  • Implementation of logging for classified data access
  • Integration tests for all components
  • Documentation for future extension

Implementation Priority

  1. Data Classification Tagging: Fastest win with schema updates
  2. LLM-Based PII Detection: Leverages existing AI capabilities
  3. Basic Retention Framework: Foundation for future policy enforcement

This approach focuses on building critical infrastructure that provides immediate security benefits while laying the groundwork for more sophisticated features later. The LLM-based PII detection is particularly valuable as it leverages your existing AI capabilities rather than building regex-based solutions that would be less effective.