Files
multitenetsaas/docs/deployment/monitoring.md
AHMET YILMAZ b3fff546e9
Some checks failed
System Monitoring / Health Checks (push) Has been cancelled
System Monitoring / Performance Monitoring (push) Has been cancelled
System Monitoring / Database Monitoring (push) Has been cancelled
System Monitoring / Cache Monitoring (push) Has been cancelled
System Monitoring / Log Monitoring (push) Has been cancelled
System Monitoring / Resource Monitoring (push) Has been cancelled
System Monitoring / Uptime Monitoring (push) Has been cancelled
System Monitoring / Backup Monitoring (push) Has been cancelled
System Monitoring / Security Monitoring (push) Has been cancelled
System Monitoring / Monitoring Dashboard (push) Has been cancelled
System Monitoring / Alerting (push) Has been cancelled
Security Scanning / Dependency Scanning (push) Has been cancelled
Security Scanning / Code Security Scanning (push) Has been cancelled
Security Scanning / Secrets Scanning (push) Has been cancelled
Security Scanning / Container Security Scanning (push) Has been cancelled
Security Scanning / Compliance Checking (push) Has been cancelled
Security Scanning / Security Dashboard (push) Has been cancelled
Security Scanning / Security Remediation (push) Has been cancelled
project initialization
2025-10-05 02:37:33 +08:00

1026 lines
28 KiB
Markdown

# Monitoring and Maintenance Guide
This guide provides comprehensive instructions for monitoring and maintaining the Multi-Tenant SaaS Platform in production environments.
## Overview
Effective monitoring and maintenance are crucial for ensuring the reliability, performance, and security of your Multi-Tenant SaaS Platform. This guide covers monitoring tools, maintenance procedures, and best practices for Malaysian SME deployments.
## Monitoring Architecture
### Components to Monitor
1. **Application Layer**: Django backend, React frontend
2. **Database Layer**: PostgreSQL with multi-tenant schemas
3. **Cache Layer**: Redis for caching and sessions
4. **Infrastructure Layer**: Server resources, network, storage
5. **Business Layer**: User activity, transactions, performance metrics
### Monitoring Stack
- **Prometheus**: Metrics collection and storage
- **Grafana**: Visualization and dashboards
- **Alertmanager**: Alerting and notifications
- **Elasticsearch**: Log aggregation and search
- **Kibana**: Log visualization and analysis
## Quick Setup
### 1. Install Monitoring Stack
```bash
# Create monitoring directory
mkdir -p /opt/monitoring
cd /opt/monitoring
# Create docker-compose.yml for monitoring
cat > docker-compose.yml << 'EOF'
version: '3.8'
services:
# Prometheus
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
networks:
- monitoring
# Grafana
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/var/lib/grafana/dashboards
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=your-secure-password
networks:
- monitoring
# Alertmanager
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
networks:
- monitoring
# Node Exporter
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($|/)'
networks:
- monitoring
# PostgreSQL Exporter
postgres-exporter:
image: prometheuscommunity/postgres-exporter:latest
ports:
- "9187:9187"
environment:
- DATA_SOURCE_NAME=postgresql://multi_tenant_prod_user:your-password@localhost:5432/multi_tenant_saas_prod?sslmode=disable
networks:
- monitoring
# Redis Exporter
redis-exporter:
image: oliver006/redis_exporter:latest
ports:
- "9121:9121"
environment:
- REDIS_ADDR=redis://localhost:6379
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
networks:
monitoring:
driver: bridge
EOF
```
### 2. Configure Prometheus
```bash
# Create Prometheus configuration
cat > prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'postgres-exporter'
static_configs:
- targets: ['localhost:9187']
- job_name: 'redis-exporter'
static_configs:
- targets: ['localhost:9121']
- job_name: 'django-app'
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics'
scrape_interval: 30s
- job_name: 'nginx'
static_configs:
- targets: ['localhost:80']
metrics_path: '/nginx_status'
scrape_interval: 30s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
EOF
```
### 3. Configure Alertmanager
```bash
# Create Alertmanager configuration
cat > alertmanager.yml << 'EOF'
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@your-domain.com'
smtp_auth_username: 'your-email@domain.com'
smtp_auth_password: 'your-email-password'
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
email_configs:
- to: 'admin@your-domain.com'
subject: '[ALERT] {{ .GroupLabels.alertname }} - {{ .Status }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ .Labels }}
{{ end }}
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
EOF
```
### 4. Create Alert Rules
```bash
# Create alert rules
cat > alert_rules.yml << 'EOF'
groups:
- name: system
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 80% for more than 5 minutes"
- alert: LowDiskSpace
expr: (node_filesystem_size_bytes{fstype!="tmpfs"} - node_filesystem_free_bytes{fstype!="tmpfs"}) / node_filesystem_size_bytes{fstype!="tmpfs"} * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space detected"
description: "Disk usage is above 85% for more than 5 minutes"
- name: database
rules:
- alert: PostgreSQLDown
expr: up{job="postgres-exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL is down"
description: "PostgreSQL database is not responding"
- alert: PostgreSQLSlowQueries
expr: rate(pg_stat_database_calls_total[5m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "High number of slow PostgreSQL queries"
description: "PostgreSQL is experiencing slow queries"
- alert: PostgreSQLConnectionsHigh
expr: sum(pg_stat_database_numbackends) / sum(pg_settings_max_connections) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High PostgreSQL connection usage"
description: "PostgreSQL connection usage is above 80%"
- name: application
rules:
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"
description: "95th percentile response time is above 1 second"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "HTTP 5xx error rate is above 5%"
- alert: ServiceDown
expr: up{job="django-app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Application service is down"
description: "The Django application is not responding"
EOF
```
### 5. Start Monitoring Stack
```bash
# Start monitoring services
docker-compose up -d
# Verify services are running
docker-compose ps
# Access monitoring dashboards
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin/your-secure-password)
# Alertmanager: http://localhost:9093
```
## Application Monitoring
### 1. Django Application Metrics
```python
# Add to settings.py
INSTALLED_APPS = [
# ... other apps
'django_prometheus',
]
MIDDLEWARE = [
'django_prometheus.middleware.PrometheusBeforeMiddleware',
# ... other middleware
'django_prometheus.middleware.PrometheusAfterMiddleware',
]
```
### 2. Custom Metrics
```python
# Create metrics.py
from prometheus_client import Counter, Histogram, Gauge
# Business metrics
active_tenants = Gauge('multi_tenant_active_tenants', 'Number of active tenants')
total_users = Gauge('multi_tenant_total_users', 'Total number of users')
total_transactions = Counter('multi_tenant_total_transactions', 'Total transactions')
# Performance metrics
api_response_time = Histogram('multi_tenant_api_response_time', 'API response time')
db_query_time = Histogram('multi_tenant_db_query_time', 'Database query time')
# Error metrics
api_errors = Counter('multi_tenant_api_errors', 'API errors', ['method', 'endpoint'])
db_errors = Counter('multi_tenant_db_errors', 'Database errors', ['operation'])
# Malaysian-specific metrics
malaysian_users = Gauge('multi_tenant_malaysian_users', 'Number of Malaysian users')
sst_transactions = Counter('multi_tenant_sst_transactions', 'SST transactions', ['rate'])
```
### 3. Database Monitoring
```sql
-- Enable PostgreSQL extensions
CREATE EXTENSION pg_stat_statements;
-- Create monitoring views
CREATE OR REPLACE VIEW monitoring.tenant_stats AS
SELECT
t.schema_name,
COUNT(u.id) as user_count,
COUNT(s.id) as subscription_count,
SUM(s.amount) as total_revenue
FROM core_tenant t
LEFT JOIN core_user u ON t.id = u.tenant_id
LEFT JOIN core_subscription s ON t.id = s.tenant_id
GROUP BY t.schema_name;
-- Performance monitoring
CREATE OR REPLACE VIEW monitoring.query_performance AS
SELECT
query,
mean_time,
calls,
total_time,
rows,
100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 100;
```
## Log Management
### 1. Centralized Logging with ELK Stack
```bash
# Create docker-compose.yml for ELK stack
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.17.0
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ports:
- "9200:9200"
volumes:
- elasticsearch_data:/usr/share/elasticsearch/data
logstash:
image: docker.elastic.co/logstash/logstash:7.17.0
volumes:
- ./logstash/pipeline:/usr/share/logstash/pipeline
ports:
- "5044:5044"
kibana:
image: docker.elastic.co/kibana/kibana:7.17.0
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
filebeat:
image: docker.elastic.co/beats/filebeat:7.17.0
volumes:
- ./filebeat.yml:/usr/share/filebeat/filebeat.yml
- /var/log:/var/log:ro
depends_on:
- elasticsearch
volumes:
elasticsearch_data:
```
### 2. Logstash Configuration
```ruby
# logstash/pipeline/logstash.conf
input {
beats {
port => 5044
}
}
filter {
if [type] == "django" {
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:logger} - %{GREEDYDATA:message}" }
}
date {
match => [ "timestamp", "ISO8601" ]
}
}
if [type] == "nginx" {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
}
# Add Malaysian timezone context
date {
match => [ "timestamp", "ISO8601" ]
target => "@timestamp"
}
ruby {
code => "event.set('[@metadata][tz_offset]', '+08:00')"
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
}
```
### 3. Filebeat Configuration
```yaml
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/multi-tenant-saas/*.log
fields:
type: django
- type: log
enabled: true
paths:
- /var/log/nginx/*.log
fields:
type: nginx
output.logstash:
hosts: ["logstash:5044"]
processors:
- add_docker_metadata:
host: "unix:///var/run/docker.sock"
```
## Business Metrics Monitoring
### 1. Key Performance Indicators (KPIs)
```python
# KPI monitoring
class BusinessMetrics:
def __init__(self):
self.active_tenants = Gauge('business_active_tenants', 'Active tenant count')
self.monthly_revenue = Gauge('business_monthly_revenue', 'Monthly revenue')
self.user_growth = Gauge('business_user_growth', 'User growth rate')
self.churn_rate = Gauge('business_churn_rate', 'Customer churn rate')
# Malaysian-specific metrics
self.malaysian_tenant_percentage = Gauge('business_malaysian_tenant_percentage', 'Percentage of Malaysian tenants')
self.sst_collected = Counter('business_sst_collected', 'SST amount collected')
self.local_payment_methods = Counter('business_local_payments', 'Local payment method usage')
def update_metrics(self):
# Update active tenants
active_count = Tenant.objects.filter(is_active=True).count()
self.active_tenants.set(active_count)
# Update monthly revenue
monthly_rev = PaymentTransaction.objects.filter(
created_at__month=datetime.now().month,
status='completed'
).aggregate(total=Sum('amount'))['total'] or 0
self.monthly_revenue.set(monthly_rev)
# Update Malaysian metrics
total_tenants = Tenant.objects.count()
malaysian_tenants = Tenant.objects.filter(
Q(business_address__country='Malaysia') |
Q(contact_phone__startswith='+60')
).count()
self.malaysian_tenant_percentage.set(
(malaysian_tenants / total_tenants * 100) if total_tenants > 0 else 0
)
```
### 2. Real-time Dashboards
Create Grafana dashboards for:
- System health overview
- Application performance
- Database performance
- Business metrics
- User activity
- Malaysian market metrics
## Malaysian-Specific Monitoring
### 1. SST Compliance Monitoring
```python
# SST monitoring
class SSTMonitor:
def __init__(self):
self.sst_rate_compliance = Gauge('sst_rate_compliance', 'SST rate compliance')
self.sst_filing_deadline = Gauge('sst_filing_days_remaining', 'Days until SST filing deadline')
self.sst_collected_vs_reported = Gauge('sst_collected_vs_reported', 'SST collected vs reported')
def check_sst_compliance(self):
# Check if SST rates are correctly applied
expected_rate = 0.06
actual_rates = PaymentTransaction.objects.filter(
created_at__month=datetime.now().month
).values_list('tax_rate', flat=True).distinct()
compliance = all(abs(rate - expected_rate) < 0.001 for rate in actual_rates)
self.sst_rate_compliance.set(1 if compliance else 0)
# Check SST filing deadline
today = datetime.now().date()
filing_deadline = self.get_sst_filing_deadline(today)
days_remaining = (filing_deadline - today).days
self.sst_filing_deadline.set(days_remaining)
# Alert if deadline is approaching
if days_remaining <= 7:
self.trigger_sst_deadline_alert(days_remaining)
```
### 2. Malaysian Business Hours Monitoring
```python
# Malaysian business hours monitoring
class BusinessHoursMonitor:
def __init__(self):
self.business_hour_activity = Gauge('business_hour_activity', 'Activity during business hours')
self.off_hour_activity = Gauge('off_hour_activity', 'Activity outside business hours')
def monitor_activity(self):
# Malaysian business hours: 9 AM - 6 PM, Monday - Friday
now = datetime.now()
is_business_hour = (
now.weekday() < 5 and # Monday - Friday
9 <= now.hour < 18 # 9 AM - 6 PM
)
if is_business_hour:
self.business_hour_activity.inc()
else:
self.off_hour_activity.inc()
```
### 3. Malaysian Payment Gateway Monitoring
```python
# Payment gateway monitoring
class PaymentGatewayMonitor:
def __init__(self):
self.payment_success_rate = Gauge('payment_success_rate', 'Payment success rate')
self.gateway_response_time = Histogram('gateway_response_time', 'Payment gateway response time')
self.gateway_downtime = Counter('gateway_downtime', 'Payment gateway downtime')
def monitor_gateways(self):
gateways = ['touch_n_go', 'grabpay', 'online_banking']
for gateway in gateways:
success_rate = self.calculate_success_rate(gateway)
self.payment_success_rate.labels(gateway=gateway).set(success_rate)
# Monitor response times
response_time = self.measure_response_time(gateway)
self.gateway_response_time.labels(gateway=gateway).observe(response_time)
# Check for downtime
if not self.is_gateway_available(gateway):
self.gateway_downtime.labels(gateway=gateway).inc()
```
## Maintenance Procedures
### 1. Daily Maintenance
```bash
#!/bin/bash
# daily_maintenance.sh
# Log maintenance
echo "$(date): Starting daily maintenance" >> /var/log/maintenance.log
# Rotate logs
logrotate -f /etc/logrotate.d/multi-tenant-saas
# Clear old logs
find /var/log/multi-tenant-saas -name "*.log.*" -mtime +30 -delete
# Monitor disk space
df -h | awk '$5+0 > 85 {print $6 " is " $5 " full"}' >> /var/log/maintenance.log
# Check service health
systemctl is-active --quiet gunicorn || echo "Gunicorn service is down" >> /var/log/maintenance.log
systemctl is-active --quiet nginx || echo "Nginx service is down" >> /var/log/maintenance.log
# Check database connections
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "SELECT count(*) FROM pg_stat_activity;" >> /var/log/maintenance.log
# Clear cache
redis-cli FLUSHDB >> /var/log/maintenance.log
echo "$(date): Daily maintenance completed" >> /var/log/maintenance.log
```
### 2. Weekly Maintenance
```bash
#!/bin/bash
# weekly_maintenance.sh
# Database maintenance
echo "$(date): Starting weekly database maintenance" >> /var/log/maintenance.log
# Vacuum and analyze
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "VACUUM ANALYZE;" >> /var/log/maintenance.log
# Update statistics
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "ANALYZE;" >> /var/log/maintenance.log
# Check table sizes
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
" >> /var/log/maintenance.log
# Index maintenance
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "REINDEX DATABASE multi_tenant_saas_prod;" >> /var/log/maintenance.log
echo "$(date): Weekly database maintenance completed" >> /var/log/maintenance.log
```
### 3. Monthly Maintenance
```bash
#!/bin/bash
# monthly_maintenance.sh
# Security updates
echo "$(date): Starting monthly security updates" >> /var/log/maintenance.log
# Update system packages
apt-get update && apt-get upgrade -y >> /var/log/maintenance.log
# Update Python packages
source /opt/multi-tenant-saas/venv/bin/activate
pip list --outdated >> /var/log/maintenance.log
pip install --upgrade -r /opt/multi-tenant-saas/requirements.txt >> /var/log/maintenance.log
# Update Node packages
cd /opt/multi-tenant-saas/frontend
npm update >> /var/log/maintenance.log
# Database backup full
/opt/multi-tenant-saas/scripts/backup-database.sh >> /var/log/maintenance.log
# SSL certificate check
openssl x509 -in /etc/letsencrypt/live/your-domain.com/fullchain.pem -text -noout | grep "Not After" >> /var/log/maintenance.log
# Performance review
# Check slow queries
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "
SELECT query, mean_time, calls
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
" >> /var/log/maintenance.log
echo "$(date): Monthly maintenance completed" >> /var/log/maintenance.log
```
## Automated Scheduling
### 1. Cron Jobs
```bash
# Add to crontab
# Daily maintenance at 2 AM
0 2 * * * /opt/multi-tenant-saas/scripts/daily_maintenance.sh
# Weekly maintenance on Sunday at 3 AM
0 3 * * 0 /opt/multi-tenant-saas/scripts/weekly_maintenance.sh
# Monthly maintenance on 1st of month at 4 AM
0 4 1 * * /opt/multi-tenant-saas/scripts/monthly_maintenance.sh
# Database backup daily at 1 AM
0 1 * * * /opt/multi-tenant-saas/scripts/backup-database.sh
# Log rotation daily at midnight
0 0 * * * /usr/sbin/logrotate -f /etc/logrotate.d/multi-tenant-saas
# SSL certificate renewal check weekly
0 0 * * 0 /opt/multi-tenant-saas/scripts/check-ssl.sh
```
### 2. Systemd Timers
```bash
# Create systemd timer for daily maintenance
cat > /etc/systemd/system/daily-maintenance.timer << 'EOF'
[Unit]
Description=Daily maintenance tasks
Requires=daily-maintenance.service
[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true
[Install]
WantedBy=timers.target
EOF
# Create systemd service
cat > /etc/systemd/system/daily-maintenance.service << 'EOF'
[Unit]
Description=Daily maintenance tasks
[Service]
Type=oneshot
ExecStart=/opt/multi-tenant-saas/scripts/daily_maintenance.sh
User=root
Group=root
EOF
# Enable timer
systemctl enable daily-maintenance.timer
systemctl start daily-maintenance.timer
```
## Disaster Recovery
### 1. Backup Verification
```bash
#!/bin/bash
# verify_backups.sh
BACKUP_DIR="/opt/multi-tenant-saas/backups"
LOG_FILE="/var/log/backup-verification.log"
echo "$(date): Starting backup verification" >> $LOG_FILE
# Check if backups exist
if [ ! -d "$BACKUP_DIR" ]; then
echo "Backup directory does not exist" >> $LOG_FILE
exit 1
fi
# Check latest backup
LATEST_BACKUP=$(ls -t $BACKUP_DIR/database_backup_*.sql.gz | head -1)
if [ -z "$LATEST_BACKUP" ]; then
echo "No database backup found" >> $LOG_FILE
exit 1
fi
# Verify backup integrity
if gzip -t "$LATEST_BACKUP"; then
echo "Backup integrity verified: $LATEST_BACKUP" >> $LOG_FILE
else
echo "Backup integrity check failed: $LATEST_BACKUP" >> $LOG_FILE
exit 1
fi
# Check backup size
BACKUP_SIZE=$(du -h "$LATEST_BACKUP" | cut -f1)
echo "Backup size: $BACKUP_SIZE" >> $LOG_FILE
# Test restore (create test database)
TEST_DB="backup_test_$(date +%Y%m%d)"
createdb -U multi_tenant_prod_user "$TEST_DB"
gunzip -c "$LATEST_BACKUP" | psql -U multi_tenant_prod_user "$TEST_DB"
# Verify data
TABLE_COUNT=$(psql -U multi_tenant_prod_user -d "$TEST_DB" -t -c "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';")
echo "Table count in backup: $TABLE_COUNT" >> $LOG_FILE
# Clean up test database
dropdb -U multi_tenant_prod_user "$TEST_DB"
echo "$(date): Backup verification completed successfully" >> $LOG_FILE
```
### 2. Failover Procedures
```bash
#!/bin/bash
# failover_procedures.sh
PRIMARY_SERVER="primary.your-domain.com"
STANDBY_SERVER="standby.your-domain.com"
# Check primary server health
if ! curl -f http://$PRIMARY_SERVER/health/ > /dev/null 2>&1; then
echo "$(date): Primary server is down, initiating failover" >> /var/log/failover.log
# Promote standby
ssh $STANDBY_SERVER "sudo systemctl promote postgresql"
# Update DNS
# This would integrate with your DNS provider API
curl -X POST "https://api.dns-provider.com/update" \
-H "Authorization: Bearer $DNS_API_KEY" \
-d '{"record":"your-domain.com","value":"'$STANDBY_SERVER'"}'
# Notify administrators
echo "Failover completed. Standby server is now primary." | mail -s "Failover Completed" admin@your-domain.com
echo "$(date): Failover completed" >> /var/log/failover.log
fi
```
## Performance Optimization
### 1. Database Optimization
```sql
-- Create performance monitoring views
CREATE OR REPLACE VIEW monitoring.performance_metrics AS
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
pg_stat_get_numscans(quote_ident(schemaname)||'.'||quote_ident(tablename)) as scans,
pg_stat_get_tuples_returned(quote_ident(schemaname)||'.'||quote_ident(tablename)) as tuples_returned,
pg_stat_get_tuples_fetched(quote_ident(schemaname)||'.'||quote_ident(tablename)) as tuples_fetched
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
```
### 2. Application Optimization
```python
# Add to Django settings
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.redis.RedisCache',
'LOCATION': 'redis://localhost:6379/1',
'TIMEOUT': 300,
'OPTIONS': {
'CLIENT_CLASS': 'django_redis.client.DefaultClient',
}
}
}
# Database connection pooling
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'NAME': 'multi_tenant_saas_prod',
'USER': 'multi_tenant_prod_user',
'PASSWORD': 'your-password',
'HOST': 'localhost',
'PORT': '5432',
'CONN_MAX_AGE': 60,
'OPTIONS': {
'connect_timeout': 10,
'options': '-c statement_timeout=30000',
}
}
}
```
## Security Monitoring
### 1. Intrusion Detection
```bash
# Install fail2ban
apt-get install fail2ban
# Configure fail2ban for SSH
cat > /etc/fail2ban/jail.local << 'EOF'
[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
findtime = 600
[nginx-http-auth]
enabled = true
port = http,https
filter = nginx-http-auth
logpath = /var/log/nginx/error.log
maxretry = 5
bantime = 3600
findtime = 600
EOF
# Restart fail2ban
systemctl restart fail2ban
```
### 2. File Integrity Monitoring
```bash
# Install AIDE
apt-get install aide
# Initialize AIDE
aideinit
# Configure daily checks
cat > /etc/cron.daily/aide << 'EOF'
#!/bin/sh
/usr/bin/aide --check
EOF
chmod +x /etc/cron.daily/aide
```
## Malaysian Compliance Monitoring
### 1. PDPA Compliance Monitoring
```python
# PDPA compliance monitor
class PDPAComplianceMonitor:
def __init__(self):
self.data_retention_compliance = Gauge('pdpa_data_retention_compliance', 'PDPA data retention compliance')
self.consent_management = Gauge('pdpa_consent_management', 'PDPA consent management compliance')
self.data_breach_incidents = Counter('pdpa_data_breach_incidents', 'PDPA data breach incidents')
def check_compliance(self):
# Check data retention policies
retention_compliance = self.check_data_retention()
self.data_retention_compliance.set(1 if retention_compliance else 0)
# Check consent management
consent_compliance = self.check_consent_management()
self.consent_management.set(1 if consent_compliance else 0)
# Monitor for data breaches
breach_detected = self.detect_data_breaches()
if breach_detected:
self.data_breach_incidents.inc()
self.trigger_breach_alert()
def check_data_retention(self):
# Check if personal data is retained beyond required period
cutoff_date = datetime.now() - timedelta(days=7*365) # 7 years
# Count records older than retention period
old_records = User.objects.filter(
date_joined__lt=cutoff_date,
is_active=False
).count()
return old_records == 0
```
## Conclusion
This comprehensive monitoring and maintenance guide ensures your Multi-Tenant SaaS Platform remains reliable, performant, and compliant with Malaysian regulations. Regular monitoring, proactive maintenance, and automated alerts will help you maintain high service quality and quickly address any issues that arise.
Remember to:
- Monitor all system components regularly
- Set up appropriate alerts for critical issues
- Perform regular maintenance tasks
- Keep systems updated and secure
- Maintain compliance with Malaysian regulations
- Document all procedures and incidents
For additional support, refer to the main documentation or contact the support team.