Some checks failed
System Monitoring / Health Checks (push) Has been cancelled
System Monitoring / Performance Monitoring (push) Has been cancelled
System Monitoring / Database Monitoring (push) Has been cancelled
System Monitoring / Cache Monitoring (push) Has been cancelled
System Monitoring / Log Monitoring (push) Has been cancelled
System Monitoring / Resource Monitoring (push) Has been cancelled
System Monitoring / Uptime Monitoring (push) Has been cancelled
System Monitoring / Backup Monitoring (push) Has been cancelled
System Monitoring / Security Monitoring (push) Has been cancelled
System Monitoring / Monitoring Dashboard (push) Has been cancelled
System Monitoring / Alerting (push) Has been cancelled
Security Scanning / Dependency Scanning (push) Has been cancelled
Security Scanning / Code Security Scanning (push) Has been cancelled
Security Scanning / Secrets Scanning (push) Has been cancelled
Security Scanning / Container Security Scanning (push) Has been cancelled
Security Scanning / Compliance Checking (push) Has been cancelled
Security Scanning / Security Dashboard (push) Has been cancelled
Security Scanning / Security Remediation (push) Has been cancelled
1026 lines
28 KiB
Markdown
1026 lines
28 KiB
Markdown
# Monitoring and Maintenance Guide
|
|
|
|
This guide provides comprehensive instructions for monitoring and maintaining the Multi-Tenant SaaS Platform in production environments.
|
|
|
|
## Overview
|
|
|
|
Effective monitoring and maintenance are crucial for ensuring the reliability, performance, and security of your Multi-Tenant SaaS Platform. This guide covers monitoring tools, maintenance procedures, and best practices for Malaysian SME deployments.
|
|
|
|
## Monitoring Architecture
|
|
|
|
### Components to Monitor
|
|
1. **Application Layer**: Django backend, React frontend
|
|
2. **Database Layer**: PostgreSQL with multi-tenant schemas
|
|
3. **Cache Layer**: Redis for caching and sessions
|
|
4. **Infrastructure Layer**: Server resources, network, storage
|
|
5. **Business Layer**: User activity, transactions, performance metrics
|
|
|
|
### Monitoring Stack
|
|
- **Prometheus**: Metrics collection and storage
|
|
- **Grafana**: Visualization and dashboards
|
|
- **Alertmanager**: Alerting and notifications
|
|
- **Elasticsearch**: Log aggregation and search
|
|
- **Kibana**: Log visualization and analysis
|
|
|
|
## Quick Setup
|
|
|
|
### 1. Install Monitoring Stack
|
|
```bash
|
|
# Create monitoring directory
|
|
mkdir -p /opt/monitoring
|
|
cd /opt/monitoring
|
|
|
|
# Create docker-compose.yml for monitoring
|
|
cat > docker-compose.yml << 'EOF'
|
|
version: '3.8'
|
|
|
|
services:
|
|
# Prometheus
|
|
prometheus:
|
|
image: prom/prometheus:latest
|
|
ports:
|
|
- "9090:9090"
|
|
volumes:
|
|
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
|
- prometheus_data:/prometheus
|
|
command:
|
|
- '--config.file=/etc/prometheus/prometheus.yml'
|
|
- '--storage.tsdb.path=/prometheus'
|
|
- '--web.console.libraries=/etc/prometheus/console_libraries'
|
|
- '--web.console.templates=/etc/prometheus/consoles'
|
|
- '--storage.tsdb.retention.time=200h'
|
|
- '--web.enable-lifecycle'
|
|
networks:
|
|
- monitoring
|
|
|
|
# Grafana
|
|
grafana:
|
|
image: grafana/grafana:latest
|
|
ports:
|
|
- "3000:3000"
|
|
volumes:
|
|
- grafana_data:/var/lib/grafana
|
|
- ./grafana/dashboards:/var/lib/grafana/dashboards
|
|
- ./grafana/provisioning:/etc/grafana/provisioning
|
|
environment:
|
|
- GF_SECURITY_ADMIN_PASSWORD=your-secure-password
|
|
networks:
|
|
- monitoring
|
|
|
|
# Alertmanager
|
|
alertmanager:
|
|
image: prom/alertmanager:latest
|
|
ports:
|
|
- "9093:9093"
|
|
volumes:
|
|
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
|
|
- alertmanager_data:/alertmanager
|
|
networks:
|
|
- monitoring
|
|
|
|
# Node Exporter
|
|
node-exporter:
|
|
image: prom/node-exporter:latest
|
|
ports:
|
|
- "9100:9100"
|
|
volumes:
|
|
- /proc:/host/proc:ro
|
|
- /sys:/host/sys:ro
|
|
- /:/rootfs:ro
|
|
command:
|
|
- '--path.procfs=/host/proc'
|
|
- '--path.rootfs=/rootfs'
|
|
- '--path.sysfs=/host/sys'
|
|
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($|/)'
|
|
networks:
|
|
- monitoring
|
|
|
|
# PostgreSQL Exporter
|
|
postgres-exporter:
|
|
image: prometheuscommunity/postgres-exporter:latest
|
|
ports:
|
|
- "9187:9187"
|
|
environment:
|
|
- DATA_SOURCE_NAME=postgresql://multi_tenant_prod_user:your-password@localhost:5432/multi_tenant_saas_prod?sslmode=disable
|
|
networks:
|
|
- monitoring
|
|
|
|
# Redis Exporter
|
|
redis-exporter:
|
|
image: oliver006/redis_exporter:latest
|
|
ports:
|
|
- "9121:9121"
|
|
environment:
|
|
- REDIS_ADDR=redis://localhost:6379
|
|
networks:
|
|
- monitoring
|
|
|
|
volumes:
|
|
prometheus_data:
|
|
grafana_data:
|
|
alertmanager_data:
|
|
|
|
networks:
|
|
monitoring:
|
|
driver: bridge
|
|
EOF
|
|
```
|
|
|
|
### 2. Configure Prometheus
|
|
```bash
|
|
# Create Prometheus configuration
|
|
cat > prometheus.yml << 'EOF'
|
|
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
|
|
rule_files:
|
|
- "alert_rules.yml"
|
|
|
|
scrape_configs:
|
|
- job_name: 'prometheus'
|
|
static_configs:
|
|
- targets: ['localhost:9090']
|
|
|
|
- job_name: 'node-exporter'
|
|
static_configs:
|
|
- targets: ['localhost:9100']
|
|
|
|
- job_name: 'postgres-exporter'
|
|
static_configs:
|
|
- targets: ['localhost:9187']
|
|
|
|
- job_name: 'redis-exporter'
|
|
static_configs:
|
|
- targets: ['localhost:9121']
|
|
|
|
- job_name: 'django-app'
|
|
static_configs:
|
|
- targets: ['localhost:8000']
|
|
metrics_path: '/metrics'
|
|
scrape_interval: 30s
|
|
|
|
- job_name: 'nginx'
|
|
static_configs:
|
|
- targets: ['localhost:80']
|
|
metrics_path: '/nginx_status'
|
|
scrape_interval: 30s
|
|
|
|
alerting:
|
|
alertmanagers:
|
|
- static_configs:
|
|
- targets:
|
|
- alertmanager:9093
|
|
EOF
|
|
```
|
|
|
|
### 3. Configure Alertmanager
|
|
```bash
|
|
# Create Alertmanager configuration
|
|
cat > alertmanager.yml << 'EOF'
|
|
global:
|
|
smtp_smarthost: 'localhost:587'
|
|
smtp_from: 'alerts@your-domain.com'
|
|
smtp_auth_username: 'your-email@domain.com'
|
|
smtp_auth_password: 'your-email-password'
|
|
|
|
route:
|
|
group_by: ['alertname', 'severity']
|
|
group_wait: 10s
|
|
group_interval: 10s
|
|
repeat_interval: 1h
|
|
receiver: 'web.hook'
|
|
|
|
receivers:
|
|
- name: 'web.hook'
|
|
email_configs:
|
|
- to: 'admin@your-domain.com'
|
|
subject: '[ALERT] {{ .GroupLabels.alertname }} - {{ .Status }}'
|
|
body: |
|
|
{{ range .Alerts }}
|
|
Alert: {{ .Annotations.summary }}
|
|
Description: {{ .Annotations.description }}
|
|
Labels: {{ .Labels }}
|
|
{{ end }}
|
|
inhibit_rules:
|
|
- source_match:
|
|
severity: 'critical'
|
|
target_match:
|
|
severity: 'warning'
|
|
equal: ['alertname', 'dev', 'instance']
|
|
EOF
|
|
```
|
|
|
|
### 4. Create Alert Rules
|
|
```bash
|
|
# Create alert rules
|
|
cat > alert_rules.yml << 'EOF'
|
|
groups:
|
|
- name: system
|
|
rules:
|
|
- alert: HighCPUUsage
|
|
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High CPU usage detected"
|
|
description: "CPU usage is above 80% for more than 5 minutes"
|
|
|
|
- alert: HighMemoryUsage
|
|
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High memory usage detected"
|
|
description: "Memory usage is above 80% for more than 5 minutes"
|
|
|
|
- alert: LowDiskSpace
|
|
expr: (node_filesystem_size_bytes{fstype!="tmpfs"} - node_filesystem_free_bytes{fstype!="tmpfs"}) / node_filesystem_size_bytes{fstype!="tmpfs"} * 100 > 85
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Low disk space detected"
|
|
description: "Disk usage is above 85% for more than 5 minutes"
|
|
|
|
- name: database
|
|
rules:
|
|
- alert: PostgreSQLDown
|
|
expr: up{job="postgres-exporter"} == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "PostgreSQL is down"
|
|
description: "PostgreSQL database is not responding"
|
|
|
|
- alert: PostgreSQLSlowQueries
|
|
expr: rate(pg_stat_database_calls_total[5m]) > 100
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High number of slow PostgreSQL queries"
|
|
description: "PostgreSQL is experiencing slow queries"
|
|
|
|
- alert: PostgreSQLConnectionsHigh
|
|
expr: sum(pg_stat_database_numbackends) / sum(pg_settings_max_connections) * 100 > 80
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High PostgreSQL connection usage"
|
|
description: "PostgreSQL connection usage is above 80%"
|
|
|
|
- name: application
|
|
rules:
|
|
- alert: HighResponseTime
|
|
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High response time detected"
|
|
description: "95th percentile response time is above 1 second"
|
|
|
|
- alert: HighErrorRate
|
|
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High error rate detected"
|
|
description: "HTTP 5xx error rate is above 5%"
|
|
|
|
- alert: ServiceDown
|
|
expr: up{job="django-app"} == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Application service is down"
|
|
description: "The Django application is not responding"
|
|
EOF
|
|
```
|
|
|
|
### 5. Start Monitoring Stack
|
|
```bash
|
|
# Start monitoring services
|
|
docker-compose up -d
|
|
|
|
# Verify services are running
|
|
docker-compose ps
|
|
|
|
# Access monitoring dashboards
|
|
# Prometheus: http://localhost:9090
|
|
# Grafana: http://localhost:3000 (admin/your-secure-password)
|
|
# Alertmanager: http://localhost:9093
|
|
```
|
|
|
|
## Application Monitoring
|
|
|
|
### 1. Django Application Metrics
|
|
```python
|
|
# Add to settings.py
|
|
INSTALLED_APPS = [
|
|
# ... other apps
|
|
'django_prometheus',
|
|
]
|
|
|
|
MIDDLEWARE = [
|
|
'django_prometheus.middleware.PrometheusBeforeMiddleware',
|
|
# ... other middleware
|
|
'django_prometheus.middleware.PrometheusAfterMiddleware',
|
|
]
|
|
```
|
|
|
|
### 2. Custom Metrics
|
|
```python
|
|
# Create metrics.py
|
|
from prometheus_client import Counter, Histogram, Gauge
|
|
|
|
# Business metrics
|
|
active_tenants = Gauge('multi_tenant_active_tenants', 'Number of active tenants')
|
|
total_users = Gauge('multi_tenant_total_users', 'Total number of users')
|
|
total_transactions = Counter('multi_tenant_total_transactions', 'Total transactions')
|
|
|
|
# Performance metrics
|
|
api_response_time = Histogram('multi_tenant_api_response_time', 'API response time')
|
|
db_query_time = Histogram('multi_tenant_db_query_time', 'Database query time')
|
|
|
|
# Error metrics
|
|
api_errors = Counter('multi_tenant_api_errors', 'API errors', ['method', 'endpoint'])
|
|
db_errors = Counter('multi_tenant_db_errors', 'Database errors', ['operation'])
|
|
|
|
# Malaysian-specific metrics
|
|
malaysian_users = Gauge('multi_tenant_malaysian_users', 'Number of Malaysian users')
|
|
sst_transactions = Counter('multi_tenant_sst_transactions', 'SST transactions', ['rate'])
|
|
```
|
|
|
|
### 3. Database Monitoring
|
|
```sql
|
|
-- Enable PostgreSQL extensions
|
|
CREATE EXTENSION pg_stat_statements;
|
|
|
|
-- Create monitoring views
|
|
CREATE OR REPLACE VIEW monitoring.tenant_stats AS
|
|
SELECT
|
|
t.schema_name,
|
|
COUNT(u.id) as user_count,
|
|
COUNT(s.id) as subscription_count,
|
|
SUM(s.amount) as total_revenue
|
|
FROM core_tenant t
|
|
LEFT JOIN core_user u ON t.id = u.tenant_id
|
|
LEFT JOIN core_subscription s ON t.id = s.tenant_id
|
|
GROUP BY t.schema_name;
|
|
|
|
-- Performance monitoring
|
|
CREATE OR REPLACE VIEW monitoring.query_performance AS
|
|
SELECT
|
|
query,
|
|
mean_time,
|
|
calls,
|
|
total_time,
|
|
rows,
|
|
100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
|
|
FROM pg_stat_statements
|
|
ORDER BY total_time DESC
|
|
LIMIT 100;
|
|
```
|
|
|
|
## Log Management
|
|
|
|
### 1. Centralized Logging with ELK Stack
|
|
```bash
|
|
# Create docker-compose.yml for ELK stack
|
|
version: '3.8'
|
|
|
|
services:
|
|
elasticsearch:
|
|
image: docker.elastic.co/elasticsearch/elasticsearch:7.17.0
|
|
environment:
|
|
- discovery.type=single-node
|
|
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
|
|
ports:
|
|
- "9200:9200"
|
|
volumes:
|
|
- elasticsearch_data:/usr/share/elasticsearch/data
|
|
|
|
logstash:
|
|
image: docker.elastic.co/logstash/logstash:7.17.0
|
|
volumes:
|
|
- ./logstash/pipeline:/usr/share/logstash/pipeline
|
|
ports:
|
|
- "5044:5044"
|
|
|
|
kibana:
|
|
image: docker.elastic.co/kibana/kibana:7.17.0
|
|
ports:
|
|
- "5601:5601"
|
|
environment:
|
|
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
|
|
|
|
filebeat:
|
|
image: docker.elastic.co/beats/filebeat:7.17.0
|
|
volumes:
|
|
- ./filebeat.yml:/usr/share/filebeat/filebeat.yml
|
|
- /var/log:/var/log:ro
|
|
depends_on:
|
|
- elasticsearch
|
|
|
|
volumes:
|
|
elasticsearch_data:
|
|
```
|
|
|
|
### 2. Logstash Configuration
|
|
```ruby
|
|
# logstash/pipeline/logstash.conf
|
|
input {
|
|
beats {
|
|
port => 5044
|
|
}
|
|
}
|
|
|
|
filter {
|
|
if [type] == "django" {
|
|
grok {
|
|
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:logger} - %{GREEDYDATA:message}" }
|
|
}
|
|
date {
|
|
match => [ "timestamp", "ISO8601" ]
|
|
}
|
|
}
|
|
|
|
if [type] == "nginx" {
|
|
grok {
|
|
match => { "message" => "%{COMBINEDAPACHELOG}" }
|
|
}
|
|
date {
|
|
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
|
|
}
|
|
}
|
|
|
|
# Add Malaysian timezone context
|
|
date {
|
|
match => [ "timestamp", "ISO8601" ]
|
|
target => "@timestamp"
|
|
}
|
|
|
|
ruby {
|
|
code => "event.set('[@metadata][tz_offset]', '+08:00')"
|
|
}
|
|
}
|
|
|
|
output {
|
|
elasticsearch {
|
|
hosts => ["elasticsearch:9200"]
|
|
index => "logs-%{+YYYY.MM.dd}"
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. Filebeat Configuration
|
|
```yaml
|
|
# filebeat.yml
|
|
filebeat.inputs:
|
|
- type: log
|
|
enabled: true
|
|
paths:
|
|
- /var/log/multi-tenant-saas/*.log
|
|
fields:
|
|
type: django
|
|
|
|
- type: log
|
|
enabled: true
|
|
paths:
|
|
- /var/log/nginx/*.log
|
|
fields:
|
|
type: nginx
|
|
|
|
output.logstash:
|
|
hosts: ["logstash:5044"]
|
|
|
|
processors:
|
|
- add_docker_metadata:
|
|
host: "unix:///var/run/docker.sock"
|
|
```
|
|
|
|
## Business Metrics Monitoring
|
|
|
|
### 1. Key Performance Indicators (KPIs)
|
|
```python
|
|
# KPI monitoring
|
|
class BusinessMetrics:
|
|
def __init__(self):
|
|
self.active_tenants = Gauge('business_active_tenants', 'Active tenant count')
|
|
self.monthly_revenue = Gauge('business_monthly_revenue', 'Monthly revenue')
|
|
self.user_growth = Gauge('business_user_growth', 'User growth rate')
|
|
self.churn_rate = Gauge('business_churn_rate', 'Customer churn rate')
|
|
|
|
# Malaysian-specific metrics
|
|
self.malaysian_tenant_percentage = Gauge('business_malaysian_tenant_percentage', 'Percentage of Malaysian tenants')
|
|
self.sst_collected = Counter('business_sst_collected', 'SST amount collected')
|
|
self.local_payment_methods = Counter('business_local_payments', 'Local payment method usage')
|
|
|
|
def update_metrics(self):
|
|
# Update active tenants
|
|
active_count = Tenant.objects.filter(is_active=True).count()
|
|
self.active_tenants.set(active_count)
|
|
|
|
# Update monthly revenue
|
|
monthly_rev = PaymentTransaction.objects.filter(
|
|
created_at__month=datetime.now().month,
|
|
status='completed'
|
|
).aggregate(total=Sum('amount'))['total'] or 0
|
|
self.monthly_revenue.set(monthly_rev)
|
|
|
|
# Update Malaysian metrics
|
|
total_tenants = Tenant.objects.count()
|
|
malaysian_tenants = Tenant.objects.filter(
|
|
Q(business_address__country='Malaysia') |
|
|
Q(contact_phone__startswith='+60')
|
|
).count()
|
|
self.malaysian_tenant_percentage.set(
|
|
(malaysian_tenants / total_tenants * 100) if total_tenants > 0 else 0
|
|
)
|
|
```
|
|
|
|
### 2. Real-time Dashboards
|
|
Create Grafana dashboards for:
|
|
- System health overview
|
|
- Application performance
|
|
- Database performance
|
|
- Business metrics
|
|
- User activity
|
|
- Malaysian market metrics
|
|
|
|
## Malaysian-Specific Monitoring
|
|
|
|
### 1. SST Compliance Monitoring
|
|
```python
|
|
# SST monitoring
|
|
class SSTMonitor:
|
|
def __init__(self):
|
|
self.sst_rate_compliance = Gauge('sst_rate_compliance', 'SST rate compliance')
|
|
self.sst_filing_deadline = Gauge('sst_filing_days_remaining', 'Days until SST filing deadline')
|
|
self.sst_collected_vs_reported = Gauge('sst_collected_vs_reported', 'SST collected vs reported')
|
|
|
|
def check_sst_compliance(self):
|
|
# Check if SST rates are correctly applied
|
|
expected_rate = 0.06
|
|
actual_rates = PaymentTransaction.objects.filter(
|
|
created_at__month=datetime.now().month
|
|
).values_list('tax_rate', flat=True).distinct()
|
|
|
|
compliance = all(abs(rate - expected_rate) < 0.001 for rate in actual_rates)
|
|
self.sst_rate_compliance.set(1 if compliance else 0)
|
|
|
|
# Check SST filing deadline
|
|
today = datetime.now().date()
|
|
filing_deadline = self.get_sst_filing_deadline(today)
|
|
days_remaining = (filing_deadline - today).days
|
|
self.sst_filing_deadline.set(days_remaining)
|
|
|
|
# Alert if deadline is approaching
|
|
if days_remaining <= 7:
|
|
self.trigger_sst_deadline_alert(days_remaining)
|
|
```
|
|
|
|
### 2. Malaysian Business Hours Monitoring
|
|
```python
|
|
# Malaysian business hours monitoring
|
|
class BusinessHoursMonitor:
|
|
def __init__(self):
|
|
self.business_hour_activity = Gauge('business_hour_activity', 'Activity during business hours')
|
|
self.off_hour_activity = Gauge('off_hour_activity', 'Activity outside business hours')
|
|
|
|
def monitor_activity(self):
|
|
# Malaysian business hours: 9 AM - 6 PM, Monday - Friday
|
|
now = datetime.now()
|
|
is_business_hour = (
|
|
now.weekday() < 5 and # Monday - Friday
|
|
9 <= now.hour < 18 # 9 AM - 6 PM
|
|
)
|
|
|
|
if is_business_hour:
|
|
self.business_hour_activity.inc()
|
|
else:
|
|
self.off_hour_activity.inc()
|
|
```
|
|
|
|
### 3. Malaysian Payment Gateway Monitoring
|
|
```python
|
|
# Payment gateway monitoring
|
|
class PaymentGatewayMonitor:
|
|
def __init__(self):
|
|
self.payment_success_rate = Gauge('payment_success_rate', 'Payment success rate')
|
|
self.gateway_response_time = Histogram('gateway_response_time', 'Payment gateway response time')
|
|
self.gateway_downtime = Counter('gateway_downtime', 'Payment gateway downtime')
|
|
|
|
def monitor_gateways(self):
|
|
gateways = ['touch_n_go', 'grabpay', 'online_banking']
|
|
|
|
for gateway in gateways:
|
|
success_rate = self.calculate_success_rate(gateway)
|
|
self.payment_success_rate.labels(gateway=gateway).set(success_rate)
|
|
|
|
# Monitor response times
|
|
response_time = self.measure_response_time(gateway)
|
|
self.gateway_response_time.labels(gateway=gateway).observe(response_time)
|
|
|
|
# Check for downtime
|
|
if not self.is_gateway_available(gateway):
|
|
self.gateway_downtime.labels(gateway=gateway).inc()
|
|
```
|
|
|
|
## Maintenance Procedures
|
|
|
|
### 1. Daily Maintenance
|
|
```bash
|
|
#!/bin/bash
|
|
# daily_maintenance.sh
|
|
|
|
# Log maintenance
|
|
echo "$(date): Starting daily maintenance" >> /var/log/maintenance.log
|
|
|
|
# Rotate logs
|
|
logrotate -f /etc/logrotate.d/multi-tenant-saas
|
|
|
|
# Clear old logs
|
|
find /var/log/multi-tenant-saas -name "*.log.*" -mtime +30 -delete
|
|
|
|
# Monitor disk space
|
|
df -h | awk '$5+0 > 85 {print $6 " is " $5 " full"}' >> /var/log/maintenance.log
|
|
|
|
# Check service health
|
|
systemctl is-active --quiet gunicorn || echo "Gunicorn service is down" >> /var/log/maintenance.log
|
|
systemctl is-active --quiet nginx || echo "Nginx service is down" >> /var/log/maintenance.log
|
|
|
|
# Check database connections
|
|
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "SELECT count(*) FROM pg_stat_activity;" >> /var/log/maintenance.log
|
|
|
|
# Clear cache
|
|
redis-cli FLUSHDB >> /var/log/maintenance.log
|
|
|
|
echo "$(date): Daily maintenance completed" >> /var/log/maintenance.log
|
|
```
|
|
|
|
### 2. Weekly Maintenance
|
|
```bash
|
|
#!/bin/bash
|
|
# weekly_maintenance.sh
|
|
|
|
# Database maintenance
|
|
echo "$(date): Starting weekly database maintenance" >> /var/log/maintenance.log
|
|
|
|
# Vacuum and analyze
|
|
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "VACUUM ANALYZE;" >> /var/log/maintenance.log
|
|
|
|
# Update statistics
|
|
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "ANALYZE;" >> /var/log/maintenance.log
|
|
|
|
# Check table sizes
|
|
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "
|
|
SELECT
|
|
schemaname,
|
|
tablename,
|
|
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
|
|
FROM pg_tables
|
|
WHERE schemaname = 'public'
|
|
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
|
|
" >> /var/log/maintenance.log
|
|
|
|
# Index maintenance
|
|
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "REINDEX DATABASE multi_tenant_saas_prod;" >> /var/log/maintenance.log
|
|
|
|
echo "$(date): Weekly database maintenance completed" >> /var/log/maintenance.log
|
|
```
|
|
|
|
### 3. Monthly Maintenance
|
|
```bash
|
|
#!/bin/bash
|
|
# monthly_maintenance.sh
|
|
|
|
# Security updates
|
|
echo "$(date): Starting monthly security updates" >> /var/log/maintenance.log
|
|
|
|
# Update system packages
|
|
apt-get update && apt-get upgrade -y >> /var/log/maintenance.log
|
|
|
|
# Update Python packages
|
|
source /opt/multi-tenant-saas/venv/bin/activate
|
|
pip list --outdated >> /var/log/maintenance.log
|
|
pip install --upgrade -r /opt/multi-tenant-saas/requirements.txt >> /var/log/maintenance.log
|
|
|
|
# Update Node packages
|
|
cd /opt/multi-tenant-saas/frontend
|
|
npm update >> /var/log/maintenance.log
|
|
|
|
# Database backup full
|
|
/opt/multi-tenant-saas/scripts/backup-database.sh >> /var/log/maintenance.log
|
|
|
|
# SSL certificate check
|
|
openssl x509 -in /etc/letsencrypt/live/your-domain.com/fullchain.pem -text -noout | grep "Not After" >> /var/log/maintenance.log
|
|
|
|
# Performance review
|
|
# Check slow queries
|
|
psql -U multi_tenant_prod_user -d multi_tenant_saas_prod -c "
|
|
SELECT query, mean_time, calls
|
|
FROM pg_stat_statements
|
|
ORDER BY mean_time DESC
|
|
LIMIT 10;
|
|
" >> /var/log/maintenance.log
|
|
|
|
echo "$(date): Monthly maintenance completed" >> /var/log/maintenance.log
|
|
```
|
|
|
|
## Automated Scheduling
|
|
|
|
### 1. Cron Jobs
|
|
```bash
|
|
# Add to crontab
|
|
# Daily maintenance at 2 AM
|
|
0 2 * * * /opt/multi-tenant-saas/scripts/daily_maintenance.sh
|
|
|
|
# Weekly maintenance on Sunday at 3 AM
|
|
0 3 * * 0 /opt/multi-tenant-saas/scripts/weekly_maintenance.sh
|
|
|
|
# Monthly maintenance on 1st of month at 4 AM
|
|
0 4 1 * * /opt/multi-tenant-saas/scripts/monthly_maintenance.sh
|
|
|
|
# Database backup daily at 1 AM
|
|
0 1 * * * /opt/multi-tenant-saas/scripts/backup-database.sh
|
|
|
|
# Log rotation daily at midnight
|
|
0 0 * * * /usr/sbin/logrotate -f /etc/logrotate.d/multi-tenant-saas
|
|
|
|
# SSL certificate renewal check weekly
|
|
0 0 * * 0 /opt/multi-tenant-saas/scripts/check-ssl.sh
|
|
```
|
|
|
|
### 2. Systemd Timers
|
|
```bash
|
|
# Create systemd timer for daily maintenance
|
|
cat > /etc/systemd/system/daily-maintenance.timer << 'EOF'
|
|
[Unit]
|
|
Description=Daily maintenance tasks
|
|
Requires=daily-maintenance.service
|
|
|
|
[Timer]
|
|
OnCalendar=*-*-* 02:00:00
|
|
Persistent=true
|
|
|
|
[Install]
|
|
WantedBy=timers.target
|
|
EOF
|
|
|
|
# Create systemd service
|
|
cat > /etc/systemd/system/daily-maintenance.service << 'EOF'
|
|
[Unit]
|
|
Description=Daily maintenance tasks
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
ExecStart=/opt/multi-tenant-saas/scripts/daily_maintenance.sh
|
|
User=root
|
|
Group=root
|
|
EOF
|
|
|
|
# Enable timer
|
|
systemctl enable daily-maintenance.timer
|
|
systemctl start daily-maintenance.timer
|
|
```
|
|
|
|
## Disaster Recovery
|
|
|
|
### 1. Backup Verification
|
|
```bash
|
|
#!/bin/bash
|
|
# verify_backups.sh
|
|
|
|
BACKUP_DIR="/opt/multi-tenant-saas/backups"
|
|
LOG_FILE="/var/log/backup-verification.log"
|
|
|
|
echo "$(date): Starting backup verification" >> $LOG_FILE
|
|
|
|
# Check if backups exist
|
|
if [ ! -d "$BACKUP_DIR" ]; then
|
|
echo "Backup directory does not exist" >> $LOG_FILE
|
|
exit 1
|
|
fi
|
|
|
|
# Check latest backup
|
|
LATEST_BACKUP=$(ls -t $BACKUP_DIR/database_backup_*.sql.gz | head -1)
|
|
if [ -z "$LATEST_BACKUP" ]; then
|
|
echo "No database backup found" >> $LOG_FILE
|
|
exit 1
|
|
fi
|
|
|
|
# Verify backup integrity
|
|
if gzip -t "$LATEST_BACKUP"; then
|
|
echo "Backup integrity verified: $LATEST_BACKUP" >> $LOG_FILE
|
|
else
|
|
echo "Backup integrity check failed: $LATEST_BACKUP" >> $LOG_FILE
|
|
exit 1
|
|
fi
|
|
|
|
# Check backup size
|
|
BACKUP_SIZE=$(du -h "$LATEST_BACKUP" | cut -f1)
|
|
echo "Backup size: $BACKUP_SIZE" >> $LOG_FILE
|
|
|
|
# Test restore (create test database)
|
|
TEST_DB="backup_test_$(date +%Y%m%d)"
|
|
createdb -U multi_tenant_prod_user "$TEST_DB"
|
|
gunzip -c "$LATEST_BACKUP" | psql -U multi_tenant_prod_user "$TEST_DB"
|
|
|
|
# Verify data
|
|
TABLE_COUNT=$(psql -U multi_tenant_prod_user -d "$TEST_DB" -t -c "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';")
|
|
echo "Table count in backup: $TABLE_COUNT" >> $LOG_FILE
|
|
|
|
# Clean up test database
|
|
dropdb -U multi_tenant_prod_user "$TEST_DB"
|
|
|
|
echo "$(date): Backup verification completed successfully" >> $LOG_FILE
|
|
```
|
|
|
|
### 2. Failover Procedures
|
|
```bash
|
|
#!/bin/bash
|
|
# failover_procedures.sh
|
|
|
|
PRIMARY_SERVER="primary.your-domain.com"
|
|
STANDBY_SERVER="standby.your-domain.com"
|
|
|
|
# Check primary server health
|
|
if ! curl -f http://$PRIMARY_SERVER/health/ > /dev/null 2>&1; then
|
|
echo "$(date): Primary server is down, initiating failover" >> /var/log/failover.log
|
|
|
|
# Promote standby
|
|
ssh $STANDBY_SERVER "sudo systemctl promote postgresql"
|
|
|
|
# Update DNS
|
|
# This would integrate with your DNS provider API
|
|
curl -X POST "https://api.dns-provider.com/update" \
|
|
-H "Authorization: Bearer $DNS_API_KEY" \
|
|
-d '{"record":"your-domain.com","value":"'$STANDBY_SERVER'"}'
|
|
|
|
# Notify administrators
|
|
echo "Failover completed. Standby server is now primary." | mail -s "Failover Completed" admin@your-domain.com
|
|
|
|
echo "$(date): Failover completed" >> /var/log/failover.log
|
|
fi
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### 1. Database Optimization
|
|
```sql
|
|
-- Create performance monitoring views
|
|
CREATE OR REPLACE VIEW monitoring.performance_metrics AS
|
|
SELECT
|
|
schemaname,
|
|
tablename,
|
|
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
|
|
pg_stat_get_numscans(quote_ident(schemaname)||'.'||quote_ident(tablename)) as scans,
|
|
pg_stat_get_tuples_returned(quote_ident(schemaname)||'.'||quote_ident(tablename)) as tuples_returned,
|
|
pg_stat_get_tuples_fetched(quote_ident(schemaname)||'.'||quote_ident(tablename)) as tuples_fetched
|
|
FROM pg_tables
|
|
WHERE schemaname = 'public'
|
|
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
|
|
```
|
|
|
|
### 2. Application Optimization
|
|
```python
|
|
# Add to Django settings
|
|
CACHES = {
|
|
'default': {
|
|
'BACKEND': 'django.core.cache.backends.redis.RedisCache',
|
|
'LOCATION': 'redis://localhost:6379/1',
|
|
'TIMEOUT': 300,
|
|
'OPTIONS': {
|
|
'CLIENT_CLASS': 'django_redis.client.DefaultClient',
|
|
}
|
|
}
|
|
}
|
|
|
|
# Database connection pooling
|
|
DATABASES = {
|
|
'default': {
|
|
'ENGINE': 'django.db.backends.postgresql',
|
|
'NAME': 'multi_tenant_saas_prod',
|
|
'USER': 'multi_tenant_prod_user',
|
|
'PASSWORD': 'your-password',
|
|
'HOST': 'localhost',
|
|
'PORT': '5432',
|
|
'CONN_MAX_AGE': 60,
|
|
'OPTIONS': {
|
|
'connect_timeout': 10,
|
|
'options': '-c statement_timeout=30000',
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Security Monitoring
|
|
|
|
### 1. Intrusion Detection
|
|
```bash
|
|
# Install fail2ban
|
|
apt-get install fail2ban
|
|
|
|
# Configure fail2ban for SSH
|
|
cat > /etc/fail2ban/jail.local << 'EOF'
|
|
[sshd]
|
|
enabled = true
|
|
port = ssh
|
|
filter = sshd
|
|
logpath = /var/log/auth.log
|
|
maxretry = 3
|
|
bantime = 3600
|
|
findtime = 600
|
|
|
|
[nginx-http-auth]
|
|
enabled = true
|
|
port = http,https
|
|
filter = nginx-http-auth
|
|
logpath = /var/log/nginx/error.log
|
|
maxretry = 5
|
|
bantime = 3600
|
|
findtime = 600
|
|
EOF
|
|
|
|
# Restart fail2ban
|
|
systemctl restart fail2ban
|
|
```
|
|
|
|
### 2. File Integrity Monitoring
|
|
```bash
|
|
# Install AIDE
|
|
apt-get install aide
|
|
|
|
# Initialize AIDE
|
|
aideinit
|
|
|
|
# Configure daily checks
|
|
cat > /etc/cron.daily/aide << 'EOF'
|
|
#!/bin/sh
|
|
/usr/bin/aide --check
|
|
EOF
|
|
|
|
chmod +x /etc/cron.daily/aide
|
|
```
|
|
|
|
## Malaysian Compliance Monitoring
|
|
|
|
### 1. PDPA Compliance Monitoring
|
|
```python
|
|
# PDPA compliance monitor
|
|
class PDPAComplianceMonitor:
|
|
def __init__(self):
|
|
self.data_retention_compliance = Gauge('pdpa_data_retention_compliance', 'PDPA data retention compliance')
|
|
self.consent_management = Gauge('pdpa_consent_management', 'PDPA consent management compliance')
|
|
self.data_breach_incidents = Counter('pdpa_data_breach_incidents', 'PDPA data breach incidents')
|
|
|
|
def check_compliance(self):
|
|
# Check data retention policies
|
|
retention_compliance = self.check_data_retention()
|
|
self.data_retention_compliance.set(1 if retention_compliance else 0)
|
|
|
|
# Check consent management
|
|
consent_compliance = self.check_consent_management()
|
|
self.consent_management.set(1 if consent_compliance else 0)
|
|
|
|
# Monitor for data breaches
|
|
breach_detected = self.detect_data_breaches()
|
|
if breach_detected:
|
|
self.data_breach_incidents.inc()
|
|
self.trigger_breach_alert()
|
|
|
|
def check_data_retention(self):
|
|
# Check if personal data is retained beyond required period
|
|
cutoff_date = datetime.now() - timedelta(days=7*365) # 7 years
|
|
|
|
# Count records older than retention period
|
|
old_records = User.objects.filter(
|
|
date_joined__lt=cutoff_date,
|
|
is_active=False
|
|
).count()
|
|
|
|
return old_records == 0
|
|
```
|
|
|
|
## Conclusion
|
|
|
|
This comprehensive monitoring and maintenance guide ensures your Multi-Tenant SaaS Platform remains reliable, performant, and compliant with Malaysian regulations. Regular monitoring, proactive maintenance, and automated alerts will help you maintain high service quality and quickly address any issues that arise.
|
|
|
|
Remember to:
|
|
- Monitor all system components regularly
|
|
- Set up appropriate alerts for critical issues
|
|
- Perform regular maintenance tasks
|
|
- Keep systems updated and secure
|
|
- Maintain compliance with Malaysian regulations
|
|
- Document all procedures and incidents
|
|
|
|
For additional support, refer to the main documentation or contact the support team. |