Files
multitenetsaas/docs/deployment/monitoring.md
AHMET YILMAZ b3fff546e9
Some checks failed
System Monitoring / Health Checks (push) Has been cancelled
System Monitoring / Performance Monitoring (push) Has been cancelled
System Monitoring / Database Monitoring (push) Has been cancelled
System Monitoring / Cache Monitoring (push) Has been cancelled
System Monitoring / Log Monitoring (push) Has been cancelled
System Monitoring / Resource Monitoring (push) Has been cancelled
System Monitoring / Uptime Monitoring (push) Has been cancelled
System Monitoring / Backup Monitoring (push) Has been cancelled
System Monitoring / Security Monitoring (push) Has been cancelled
System Monitoring / Monitoring Dashboard (push) Has been cancelled
System Monitoring / Alerting (push) Has been cancelled
Security Scanning / Dependency Scanning (push) Has been cancelled
Security Scanning / Code Security Scanning (push) Has been cancelled
Security Scanning / Secrets Scanning (push) Has been cancelled
Security Scanning / Container Security Scanning (push) Has been cancelled
Security Scanning / Compliance Checking (push) Has been cancelled
Security Scanning / Security Dashboard (push) Has been cancelled
Security Scanning / Security Remediation (push) Has been cancelled
project initialization
2025-10-05 02:37:33 +08:00

28 KiB

Monitoring and Maintenance Guide

This guide provides comprehensive instructions for monitoring and maintaining the Multi-Tenant SaaS Platform in production environments.

Overview

Effective monitoring and maintenance are crucial for ensuring the reliability, performance, and security of your Multi-Tenant SaaS Platform. This guide covers monitoring tools, maintenance procedures, and best practices for Malaysian SME deployments.

Monitoring Architecture

Components to Monitor

  1. Application Layer: Django backend, React frontend
  2. Database Layer: PostgreSQL with multi-tenant schemas
  3. Cache Layer: Redis for caching and sessions
  4. Infrastructure Layer: Server resources, network, storage
  5. Business Layer: User activity, transactions, performance metrics

Monitoring Stack

  • Prometheus: Metrics collection and storage
  • Grafana: Visualization and dashboards
  • Alertmanager: Alerting and notifications
  • Elasticsearch: Log aggregation and search
  • Kibana: Log visualization and analysis

Quick Setup

1. Install Monitoring Stack

# Create monitoring directory
mkdir -p /opt/monitoring
cd /opt/monitoring

# Create docker-compose.yml for monitoring
cat > docker-compose.yml << 'EOF'
version: '3.8'

services:
  # Prometheus
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    networks:
      - monitoring

  # Grafana
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/var/lib/grafana/dashboards
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your-secure-password
    networks:
      - monitoring

  # Alertmanager
  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    networks:
      - monitoring

  # Node Exporter
  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($|/)'
    networks:
      - monitoring

  # PostgreSQL Exporter
  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:latest
    ports:
      - "9187:9187"
    environment:
      - DATA_SOURCE_NAME=postgresql://multi_tenant_prod_user:your-password@localhost:5432/multi_tenant_saas_prod?sslmode=disable
    networks:
      - monitoring

  # Redis Exporter
  redis-exporter:
    image: oliver006/redis_exporter:latest
    ports:
      - "9121:9121"
    environment:
      - REDIS_ADDR=redis://localhost:6379
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

networks:
  monitoring:
    driver: bridge
EOF

2. Configure Prometheus

# Create Prometheus configuration
cat > prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'postgres-exporter'
    static_configs:
      - targets: ['localhost:9187']

  - job_name: 'redis-exporter'
    static_configs:
      - targets: ['localhost:9121']

  - job_name: 'django-app'
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: '/metrics'
    scrape_interval: 30s

  - job_name: 'nginx'
    static_configs:
      - targets: ['localhost:80']
    metrics_path: '/nginx_status'
    scrape_interval: 30s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093
EOF

3. Configure Alertmanager

# Create Alertmanager configuration
cat > alertmanager.yml << 'EOF'
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@your-domain.com'
  smtp_auth_username: 'your-email@domain.com'
  smtp_auth_password: 'your-email-password'

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
- name: 'web.hook'
  email_configs:
  - to: 'admin@your-domain.com'
    subject: '[ALERT] {{ .GroupLabels.alertname }} - {{ .Status }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Labels: {{ .Labels }}
      {{ end }}
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
EOF

4. Create Alert Rules

# Create alert rules
cat > alert_rules.yml << 'EOF'
groups:
- name: system
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "CPU usage is above 80% for more than 5 minutes"

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage detected"
      description: "Memory usage is above 80% for more than 5 minutes"

  - alert: LowDiskSpace
    expr: (node_filesystem_size_bytes{fstype!="tmpfs"} - node_filesystem_free_bytes{fstype!="tmpfs"}) / node_filesystem_size_bytes{fstype!="tmpfs"} * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Low disk space detected"
      description: "Disk usage is above 85% for more than 5 minutes"

- name: database
  rules:
  - alert: PostgreSQLDown
    expr: up{job="postgres-exporter"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "PostgreSQL is down"
      description: "PostgreSQL database is not responding"

  - alert: PostgreSQLSlowQueries
    expr: rate(pg_stat_database_calls_total[5m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High number of slow PostgreSQL queries"
      description: "PostgreSQL is experiencing slow queries"

  - alert: PostgreSQLConnectionsHigh
    expr: sum(pg_stat_database_numbackends) / sum(pg_settings_max_connections) * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High PostgreSQL connection usage"
      description: "PostgreSQL connection usage is above 80%"

- name: application
  rules:
  - alert: HighResponseTime
    expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High response time detected"
      description: "95th percentile response time is above 1 second"

  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected"
      description: "HTTP 5xx error rate is above 5%"

  - alert: ServiceDown
    expr: up{job="django-app"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Application service is down"
      description: "The Django application is not responding"
EOF

5. Start Monitoring Stack

# Start monitoring services
docker-compose up -d

# Verify services are running
docker-compose ps

# Access monitoring dashboards
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin/your-secure-password)
# Alertmanager: http://localhost:9093

Application Monitoring

1. Django Application Metrics

# Add to settings.py
INSTALLED_APPS = [
    # ... other apps
    'django_prometheus',
]

MIDDLEWARE = [
    'django_prometheus.middleware.PrometheusBeforeMiddleware',
    # ... other middleware
    'django_prometheus.middleware.PrometheusAfterMiddleware',
]

2. Custom Metrics

# Create metrics.py
from prometheus_client import Counter, Histogram, Gauge

# Business metrics
active_tenants = Gauge('multi_tenant_active_tenants', 'Number of active tenants')
total_users = Gauge('multi_tenant_total_users', 'Total number of users')
total_transactions = Counter('multi_tenant_total_transactions', 'Total transactions')

# Performance metrics
api_response_time = Histogram('multi_tenant_api_response_time', 'API response time')
db_query_time = Histogram('multi_tenant_db_query_time', 'Database query time')

# Error metrics
api_errors = Counter('multi_tenant_api_errors', 'API errors', ['method', 'endpoint'])
db_errors = Counter('multi_tenant_db_errors', 'Database errors', ['operation'])

# Malaysian-specific metrics
malaysian_users = Gauge('multi_tenant_malaysian_users', 'Number of Malaysian users')
sst_transactions = Counter('multi_tenant_sst_transactions', 'SST transactions', ['rate'])

3. Database Monitoring

-- Enable PostgreSQL extensions
CREATE EXTENSION pg_stat_statements;

-- Create monitoring views
CREATE OR REPLACE VIEW monitoring.tenant_stats AS
SELECT
    t.schema_name,
    COUNT(u.id) as user_count,
    COUNT(s.id) as subscription_count,
    SUM(s.amount) as total_revenue
FROM core_tenant t
LEFT JOIN core_user u ON t.id = u.tenant_id
LEFT JOIN core_subscription s ON t.id = s.tenant_id
GROUP BY t.schema_name;

-- Performance monitoring
CREATE OR REPLACE VIEW monitoring.query_performance AS
SELECT
    query,
    mean_time,
    calls,
    total_time,
    rows,
    100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 100;

Log Management

1. Centralized Logging with ELK Stack

# Create docker-compose.yml for ELK stack
version: '3.8'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.17.0
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data

  logstash:
    image: docker.elastic.co/logstash/logstash:7.17.0
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
    ports:
      - "5044:5044"

  kibana:
    image: docker.elastic.co/kibana/kibana:7.17.0
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200

  filebeat:
    image: docker.elastic.co/beats/filebeat:7.17.0
    volumes:
      - ./filebeat.yml:/usr/share/filebeat/filebeat.yml
      - /var/log:/var/log:ro
    depends_on:
      - elasticsearch

volumes:
  elasticsearch_data:

2. Logstash Configuration

# logstash/pipeline/logstash.conf
input {
  beats {
    port => 5044
  }
}

filter {
  if [type] == "django" {
    grok {
      match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:logger} - %{GREEDYDATA:message}" }
    }
    date {
      match => [ "timestamp", "ISO8601" ]
    }
  }

  if [type] == "nginx" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
  }

  # Add Malaysian timezone context
  date {
    match => [ "timestamp", "ISO8601" ]
    target => "@timestamp"
  }

  ruby {
    code => "event.set('[@metadata][tz_offset]', '+08:00')"
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

3. Filebeat Configuration

# filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/multi-tenant-saas/*.log
  fields:
    type: django

- type: log
  enabled: true
  paths:
    - /var/log/nginx/*.log
  fields:
    type: nginx

output.logstash:
  hosts: ["logstash:5044"]

processors:
- add_docker_metadata:
    host: "unix:///var/run/docker.sock"

Business Metrics Monitoring

1. Key Performance Indicators (KPIs)

# KPI monitoring
class BusinessMetrics:
    def __init__(self):
        self.active_tenants = Gauge('business_active_tenants', 'Active tenant count')
        self.monthly_revenue = Gauge('business_monthly_revenue', 'Monthly revenue')
        self.user_growth = Gauge('business_user_growth', 'User growth rate')
        self.churn_rate = Gauge('business_churn_rate', 'Customer churn rate')

        # Malaysian-specific metrics
        self.malaysian_tenant_percentage = Gauge('business_malaysian_tenant_percentage', 'Percentage of Malaysian tenants')
        self.sst_collected = Counter('business_sst_collected', 'SST amount collected')
        self.local_payment_methods = Counter('business_local_payments', 'Local payment method usage')

    def update_metrics(self):
        # Update active tenants
        active_count = Tenant.objects.filter(is_active=True).count()
        self.active_tenants.set(active_count)

        # Update monthly revenue
        monthly_rev = PaymentTransaction.objects.filter(
            created_at__month=datetime.now().month,
            status='completed'
        ).aggregate(total=Sum('amount'))['total'] or 0
        self.monthly_revenue.set(monthly_rev)

        # Update Malaysian metrics
        total_tenants = Tenant.objects.count()
        malaysian_tenants = Tenant.objects.filter(
            Q(business_address__country='Malaysia') |
            Q(contact_phone__startswith='+60')
        ).count()
        self.malaysian_tenant_percentage.set(
            (malaysian_tenants / total_tenants * 100) if total_tenants > 0 else 0
        )

2. Real-time Dashboards

Create Grafana dashboards for:

  • System health overview
  • Application performance
  • Database performance
  • Business metrics
  • User activity
  • Malaysian market metrics

Malaysian-Specific Monitoring

1. SST Compliance Monitoring

# SST monitoring
class SSTMonitor:
    def __init__(self):
        self.sst_rate_compliance = Gauge('sst_rate_compliance', 'SST rate compliance')
        self.sst_filing_deadline = Gauge('sst_filing_days_remaining', 'Days until SST filing deadline')
        self.sst_collected_vs_reported = Gauge('sst_collected_vs_reported', 'SST collected vs reported')

    def check_sst_compliance(self):
        # Check if SST rates are correctly applied
        expected_rate = 0.06
        actual_rates = PaymentTransaction.objects.filter(
            created_at__month=datetime.now().month
        ).values_list('tax_rate', flat=True).distinct()

        compliance = all(abs(rate - expected_rate) < 0.001 for rate in actual_rates)
        self.sst_rate_compliance.set(1 if compliance else 0)

        # Check SST filing deadline
        today = datetime.now().date()
        filing_deadline = self.get_sst_filing_deadline(today)
        days_remaining = (filing_deadline - today).days
        self.sst_filing_deadline.set(days_remaining)

        # Alert if deadline is approaching
        if days_remaining <= 7:
            self.trigger_sst_deadline_alert(days_remaining)

2. Malaysian Business Hours Monitoring

# Malaysian business hours monitoring
class BusinessHoursMonitor:
    def __init__(self):
        self.business_hour_activity = Gauge('business_hour_activity', 'Activity during business hours')
        self.off_hour_activity = Gauge('off_hour_activity', 'Activity outside business hours')

    def monitor_activity(self):
        # Malaysian business hours: 9 AM - 6 PM, Monday - Friday
        now = datetime.now()
        is_business_hour = (
            now.weekday() < 5 and  # Monday - Friday
            9 <= now.hour < 18     # 9 AM - 6 PM
        )

        if is_business_hour:
            self.business_hour_activity.inc()
        else:
            self.off_hour_activity.inc()

3. Malaysian Payment Gateway Monitoring

# Payment gateway monitoring
class PaymentGatewayMonitor:
    def __init__(self):
        self.payment_success_rate = Gauge('payment_success_rate', 'Payment success rate')
        self.gateway_response_time = Histogram('gateway_response_time', 'Payment gateway response time')
        self.gateway_downtime = Counter('gateway_downtime', 'Payment gateway downtime')

    def monitor_gateways(self):
        gateways = ['touch_n_go', 'grabpay', 'online_banking']

        for gateway in gateways:
            success_rate = self.calculate_success_rate(gateway)
            self.payment_success_rate.labels(gateway=gateway).set(success_rate)

            # Monitor response times
            response_time = self.measure_response_time(gateway)
            self.gateway_response_time.labels(gateway=gateway).observe(response_time)

            # Check for downtime
            if not self.is_gateway_available(gateway):
                self.gateway_downtime.labels(gateway=gateway).inc()

Maintenance Procedures

1. Daily Maintenance

#!/bin/bash
# daily_maintenance.sh

# Log maintenance
echo "$(date): Starting daily maintenance" >> /var/log/maintenance.log

# Rotate logs
logrotate -f /etc/logrotate.d/multi-tenant-saas

# Clear old logs
find /var/log/multi-tenant-saas -name "*.log.*" -mtime +30 -delete

# Monitor disk space
df -h | awk '$5+0 > 85 {print $6 " is " $5 " full"}' >> /var/log/maintenance.log

# Check service health
systemctl is-active --quiet gunicorn || echo "Gunicorn service is down" >> /var/log/maintenance.log
systemctl is-active --quiet nginx || echo "Nginx service is down" >> /var/log/maintenance.log

# Check database connections
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "SELECT count(*) FROM pg_stat_activity;" >> /var/log/maintenance.log

# Clear cache
redis-cli FLUSHDB >> /var/log/maintenance.log

echo "$(date): Daily maintenance completed" >> /var/log/maintenance.log

2. Weekly Maintenance

#!/bin/bash
# weekly_maintenance.sh

# Database maintenance
echo "$(date): Starting weekly database maintenance" >> /var/log/maintenance.log

# Vacuum and analyze
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "VACUUM ANALYZE;" >> /var/log/maintenance.log

# Update statistics
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "ANALYZE;" >> /var/log/maintenance.log

# Check table sizes
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "
    SELECT
        schemaname,
        tablename,
        pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
    FROM pg_tables
    WHERE schemaname = 'public'
    ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
" >> /var/log/maintenance.log

# Index maintenance
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "REINDEX DATABASE multi_tenant_saas_prod;" >> /var/log/maintenance.log

echo "$(date): Weekly database maintenance completed" >> /var/log/maintenance.log

3. Monthly Maintenance

#!/bin/bash
# monthly_maintenance.sh

# Security updates
echo "$(date): Starting monthly security updates" >> /var/log/maintenance.log

# Update system packages
apt-get update && apt-get upgrade -y >> /var/log/maintenance.log

# Update Python packages
source /opt/multi-tenant-saas/venv/bin/activate
pip list --outdated >> /var/log/maintenance.log
pip install --upgrade -r /opt/multi-tenant-saas/requirements.txt >> /var/log/maintenance.log

# Update Node packages
cd /opt/multi-tenant-saas/frontend
npm update >> /var/log/maintenance.log

# Database backup full
/opt/multi-tenant-saas/scripts/backup-database.sh >> /var/log/maintenance.log

# SSL certificate check
openssl x509 -in /etc/letsencrypt/live/your-domain.com/fullchain.pem -text -noout | grep "Not After" >> /var/log/maintenance.log

# Performance review
# Check slow queries
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "
    SELECT query, mean_time, calls
    FROM pg_stat_statements
    ORDER BY mean_time DESC
    LIMIT 10;
" >> /var/log/maintenance.log

echo "$(date): Monthly maintenance completed" >> /var/log/maintenance.log

Automated Scheduling

1. Cron Jobs

# Add to crontab
# Daily maintenance at 2 AM
0 2 * * * /opt/multi-tenant-saas/scripts/daily_maintenance.sh

# Weekly maintenance on Sunday at 3 AM
0 3 * * 0 /opt/multi-tenant-saas/scripts/weekly_maintenance.sh

# Monthly maintenance on 1st of month at 4 AM
0 4 1 * * /opt/multi-tenant-saas/scripts/monthly_maintenance.sh

# Database backup daily at 1 AM
0 1 * * * /opt/multi-tenant-saas/scripts/backup-database.sh

# Log rotation daily at midnight
0 0 * * * /usr/sbin/logrotate -f /etc/logrotate.d/multi-tenant-saas

# SSL certificate renewal check weekly
0 0 * * 0 /opt/multi-tenant-saas/scripts/check-ssl.sh

2. Systemd Timers

# Create systemd timer for daily maintenance
cat > /etc/systemd/system/daily-maintenance.timer << 'EOF'
[Unit]
Description=Daily maintenance tasks
Requires=daily-maintenance.service

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true

[Install]
WantedBy=timers.target
EOF

# Create systemd service
cat > /etc/systemd/system/daily-maintenance.service << 'EOF'
[Unit]
Description=Daily maintenance tasks

[Service]
Type=oneshot
ExecStart=/opt/multi-tenant-saas/scripts/daily_maintenance.sh
User=root
Group=root
EOF

# Enable timer
systemctl enable daily-maintenance.timer
systemctl start daily-maintenance.timer

Disaster Recovery

1. Backup Verification

#!/bin/bash
# verify_backups.sh

BACKUP_DIR="/opt/multi-tenant-saas/backups"
LOG_FILE="/var/log/backup-verification.log"

echo "$(date): Starting backup verification" >> $LOG_FILE

# Check if backups exist
if [ ! -d "$BACKUP_DIR" ]; then
    echo "Backup directory does not exist" >> $LOG_FILE
    exit 1
fi

# Check latest backup
LATEST_BACKUP=$(ls -t $BACKUP_DIR/database_backup_*.sql.gz | head -1)
if [ -z "$LATEST_BACKUP" ]; then
    echo "No database backup found" >> $LOG_FILE
    exit 1
fi

# Verify backup integrity
if gzip -t "$LATEST_BACKUP"; then
    echo "Backup integrity verified: $LATEST_BACKUP" >> $LOG_FILE
else
    echo "Backup integrity check failed: $LATEST_BACKUP" >> $LOG_FILE
    exit 1
fi

# Check backup size
BACKUP_SIZE=$(du -h "$LATEST_BACKUP" | cut -f1)
echo "Backup size: $BACKUP_SIZE" >> $LOG_FILE

# Test restore (create test database)
TEST_DB="backup_test_$(date +%Y%m%d)"
createdb -U multi_tenant_prod_user "$TEST_DB"
gunzip -c "$LATEST_BACKUP" | psql -U multi_tenant_prod_user "$TEST_DB"

# Verify data
TABLE_COUNT=$(psql -U multi_tenant_prod_user -d "$TEST_DB" -t -c "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';")
echo "Table count in backup: $TABLE_COUNT" >> $LOG_FILE

# Clean up test database
dropdb -U multi_tenant_prod_user "$TEST_DB"

echo "$(date): Backup verification completed successfully" >> $LOG_FILE

2. Failover Procedures

#!/bin/bash
# failover_procedures.sh

PRIMARY_SERVER="primary.your-domain.com"
STANDBY_SERVER="standby.your-domain.com"

# Check primary server health
if ! curl -f http://$PRIMARY_SERVER/health/ > /dev/null 2>&1; then
    echo "$(date): Primary server is down, initiating failover" >> /var/log/failover.log

    # Promote standby
    ssh $STANDBY_SERVER "sudo systemctl promote postgresql"

    # Update DNS
    # This would integrate with your DNS provider API
    curl -X POST "https://api.dns-provider.com/update" \
         -H "Authorization: Bearer $DNS_API_KEY" \
         -d '{"record":"your-domain.com","value":"'$STANDBY_SERVER'"}'

    # Notify administrators
    echo "Failover completed. Standby server is now primary." | mail -s "Failover Completed" admin@your-domain.com

    echo "$(date): Failover completed" >> /var/log/failover.log
fi

Performance Optimization

1. Database Optimization

-- Create performance monitoring views
CREATE OR REPLACE VIEW monitoring.performance_metrics AS
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
    pg_stat_get_numscans(quote_ident(schemaname)||'.'||quote_ident(tablename)) as scans,
    pg_stat_get_tuples_returned(quote_ident(schemaname)||'.'||quote_ident(tablename)) as tuples_returned,
    pg_stat_get_tuples_fetched(quote_ident(schemaname)||'.'||quote_ident(tablename)) as tuples_fetched
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

2. Application Optimization

# Add to Django settings
CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.redis.RedisCache',
        'LOCATION': 'redis://localhost:6379/1',
        'TIMEOUT': 300,
        'OPTIONS': {
            'CLIENT_CLASS': 'django_redis.client.DefaultClient',
        }
    }
}

# Database connection pooling
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'NAME': 'multi_tenant_saas_prod',
        'USER': 'multi_tenant_prod_user',
        'PASSWORD': 'your-password',
        'HOST': 'localhost',
        'PORT': '5432',
        'CONN_MAX_AGE': 60,
        'OPTIONS': {
            'connect_timeout': 10,
            'options': '-c statement_timeout=30000',
        }
    }
}

Security Monitoring

1. Intrusion Detection

# Install fail2ban
apt-get install fail2ban

# Configure fail2ban for SSH
cat > /etc/fail2ban/jail.local << 'EOF'
[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
findtime = 600

[nginx-http-auth]
enabled = true
port = http,https
filter = nginx-http-auth
logpath = /var/log/nginx/error.log
maxretry = 5
bantime = 3600
findtime = 600
EOF

# Restart fail2ban
systemctl restart fail2ban

2. File Integrity Monitoring

# Install AIDE
apt-get install aide

# Initialize AIDE
aideinit

# Configure daily checks
cat > /etc/cron.daily/aide << 'EOF'
#!/bin/sh
/usr/bin/aide --check
EOF

chmod +x /etc/cron.daily/aide

Malaysian Compliance Monitoring

1. PDPA Compliance Monitoring

# PDPA compliance monitor
class PDPAComplianceMonitor:
    def __init__(self):
        self.data_retention_compliance = Gauge('pdpa_data_retention_compliance', 'PDPA data retention compliance')
        self.consent_management = Gauge('pdpa_consent_management', 'PDPA consent management compliance')
        self.data_breach_incidents = Counter('pdpa_data_breach_incidents', 'PDPA data breach incidents')

    def check_compliance(self):
        # Check data retention policies
        retention_compliance = self.check_data_retention()
        self.data_retention_compliance.set(1 if retention_compliance else 0)

        # Check consent management
        consent_compliance = self.check_consent_management()
        self.consent_management.set(1 if consent_compliance else 0)

        # Monitor for data breaches
        breach_detected = self.detect_data_breaches()
        if breach_detected:
            self.data_breach_incidents.inc()
            self.trigger_breach_alert()

    def check_data_retention(self):
        # Check if personal data is retained beyond required period
        cutoff_date = datetime.now() - timedelta(days=7*365)  # 7 years

        # Count records older than retention period
        old_records = User.objects.filter(
            date_joined__lt=cutoff_date,
            is_active=False
        ).count()

        return old_records == 0

Conclusion

This comprehensive monitoring and maintenance guide ensures your Multi-Tenant SaaS Platform remains reliable, performant, and compliant with Malaysian regulations. Regular monitoring, proactive maintenance, and automated alerts will help you maintain high service quality and quickly address any issues that arise.

Remember to:

  • Monitor all system components regularly
  • Set up appropriate alerts for critical issues
  • Perform regular maintenance tasks
  • Keep systems updated and secure
  • Maintain compliance with Malaysian regulations
  • Document all procedures and incidents

For additional support, refer to the main documentation or contact the support team.