Active Graph KG - Production Deployment Checklist¶

Current Status: Phase 1+ Complete (90% Production Ready) Target: 100% Production Ready Last Updated: 2025-11-24

Pre-Deployment Verification¶

✅ Phase 1+ Features Verified¶

Run automated verification:

./verify_phase1_plus.sh

Expected output: ✅ ALL CHECKS PASSED - Phase 1+ Complete! (34/34 checks)

Run comprehensive tests:

# Phase 1 MVP tests
python tests/test_phase1_complete.py

# Phase 1+ improvement tests
python tests/test_phase1_plus.py

# E2E smoke test (requires API running)
uvicorn activekg.api.main:app --reload &
sleep 5
python scripts/smoke_test.py

Database Setup (Production)¶

1. PostgreSQL + pgvector Installation¶

Option A: Docker (Recommended for testing)

docker run -d \
  --name activekg-postgres \
  --restart unless-stopped \
  -e POSTGRES_USER=activekg \
  -e POSTGRES_PASSWORD=$(openssl rand -base64 32) \
  -e POSTGRES_DB=activekg \
  -v activekg_data:/var/lib/postgresql/data \
  -p 5432:5432 \
  ankane/pgvector:pg16

Option B: Native Installation (Recommended for production)

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y postgresql-16 postgresql-16-pgvector

# macOS (Homebrew)
brew install postgresql@16 pgvector

# Start PostgreSQL
sudo systemctl start postgresql
sudo systemctl enable postgresql

2. Database Initialization¶

# Set connection string
export ACTIVEKG_DSN='postgresql://activekg:YOUR_SECURE_PASSWORD@localhost:5432/activekg'

# Create database and enable pgvector
psql -U postgres -c "CREATE DATABASE activekg;"
psql -U postgres -d activekg -c "CREATE EXTENSION IF NOT EXISTS vector;"

# Create schema
psql $ACTIVEKG_DSN -f db/init.sql

# Enable Row-Level Security (REQUIRED for multi-tenancy)
psql $ACTIVEKG_DSN -f enable_rls_policies.sql

# Create vector index for performance
psql $ACTIVEKG_DSN -f enable_vector_index.sql

3. Database Tuning (Production)¶

Add to postgresql.conf:

# Memory settings (adjust based on available RAM)
shared_buffers = 4GB
effective_cache_size = 12GB
work_mem = 64MB
maintenance_work_mem = 1GB

# Connection settings
max_connections = 200

# Write-ahead log
wal_buffers = 16MB
checkpoint_completion_target = 0.9

# Query planner
random_page_cost = 1.1  # SSD
effective_io_concurrency = 200

# Parallel query
max_parallel_workers_per_gather = 4
max_parallel_workers = 8

Restart PostgreSQL:

sudo systemctl restart postgresql

4. Verify Database¶

# Check pgvector extension
psql $ACTIVEKG_DSN -c "SELECT * FROM pg_extension WHERE extname = 'vector';"

# Check tables exist
psql $ACTIVEKG_DSN -c "\dt"

# Check RLS is enabled
psql $ACTIVEKG_DSN -c "SELECT tablename, rowsecurity FROM pg_tables WHERE schemaname = 'public';"

# Check indexes
psql $ACTIVEKG_DSN -c "\di"

Application Setup¶

1. Create Production virtualenv¶

python3.11 -m venv /opt/activekg/venv
source /opt/activekg/venv/bin/activate

pip install --upgrade pip setuptools wheel
pip install -r requirements.txt

2. Environment Configuration¶

Create /opt/activekg/.env:

# Database
ACTIVEKG_DSN='postgresql://activekg:YOUR_SECURE_PASSWORD@localhost:5432/activekg'

# Embeddings
EMBEDDING_BACKEND='sentence-transformers'
EMBEDDING_MODEL='all-MiniLM-L6-v2'

# API
ACTIVEKG_VERSION='1.0.0'
WORKERS=4

# Security (ADD THESE FOR 100% PRODUCTION READY)
# JWT_SECRET_KEY='your-256-bit-secret-key'
# JWT_ALGORITHM='HS256'
# JWT_EXPIRATION_HOURS=24

# Rate Limiting
# RATE_LIMIT_PER_MINUTE=60
# RATE_LIMIT_PER_HOUR=1000

3. Systemd Service (Linux)¶

Create /etc/systemd/system/activekg.service:

[Unit]
Description=Active Graph KG API Server
After=network.target postgresql.service

[Service]
Type=notify
User=activekg
Group=activekg
WorkingDirectory=/opt/activekg
EnvironmentFile=/opt/activekg/.env
ExecStart=/opt/activekg/venv/bin/uvicorn activekg.api.main:app \
    --host 0.0.0.0 \
    --port 8000 \
    --workers 4 \
    --log-config logging.conf
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable activekg
sudo systemctl start activekg
sudo systemctl status activekg

Security Hardening (Critical for Production)¶

1. Add JWT Authentication (Required for 100%)¶

Create activekg/middleware/auth.py:

from fastapi import HTTPException, Security
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import jwt
import os

security = HTTPBearer()

def verify_jwt(credentials: HTTPAuthorizationCredentials = Security(security)):
    """Verify JWT token and extract tenant_id."""
    try:
        token = credentials.credentials
        payload = jwt.decode(
            token,
            os.getenv('JWT_SECRET_KEY'),
            algorithms=[os.getenv('JWT_ALGORITHM', 'HS256')]
        )
        return payload  # Should include 'tenant_id', 'user_id', 'scopes'
    except jwt.ExpiredSignatureError:
        raise HTTPException(status_code=401, detail="Token expired")
    except jwt.InvalidTokenError:
        raise HTTPException(status_code=401, detail="Invalid token")

Update activekg/api/main.py:

from activekg.middleware.auth import verify_jwt
from fastapi import Depends

@app.post("/search")
def search_nodes(request: KGSearchRequest, auth: dict = Depends(verify_jwt)):
    # Extract tenant_id from JWT claims
    tenant_id = auth.get('tenant_id')

    # Perform search with tenant isolation
    results = repo.vector_search(
        query_embedding=query_vec,
        tenant_id=tenant_id,  # Use JWT tenant_id
        ...
    )

2. Add Rate Limiting (Required for 100%)¶

Install slowapi:

pip install slowapi

Update activekg/api/main.py:

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/search")
@limiter.limit("60/minute")  # 60 requests per minute
def search_nodes(request: Request, ...):
    ...

3. Payload Size Limits¶

Update activekg/graph/repository.py:

def _load_from_file(self, file_path: str) -> str:
    """Load text from local file with size limit."""
    MAX_PAYLOAD_SIZE = 10 * 1024 * 1024  # 10MB

    file_size = os.path.getsize(file_path)
    if file_size > MAX_PAYLOAD_SIZE:
        self.logger.warning(f"Payload too large: {file_size} bytes")
        return ''

    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        return f.read(MAX_PAYLOAD_SIZE)

4. HTTPS/TLS (Required for production)¶

Option A: Nginx Reverse Proxy

server {
    listen 443 ssl http2;
    server_name activekg.example.com;

    ssl_certificate /etc/letsencrypt/live/activekg.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/activekg.example.com/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Option B: Railway/Render (TLS automatic)

# Deploy to Railway
railway login
railway init
railway up

Monitoring Setup¶

1. Prometheus Configuration¶

Create prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'activekg'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: '/prometheus'

  - job_name: 'postgres'
    static_configs:
      - targets: ['localhost:9187']  # postgres_exporter

Start Prometheus:

docker run -d \
  --name prometheus \
  --restart unless-stopped \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

2. Grafana Dashboards¶

Start Grafana:

docker run -d \
  --name grafana \
  --restart unless-stopped \
  -p 3000:3000 \
  -e GF_SECURITY_ADMIN_PASSWORD=admin \
  grafana/grafana

Add Prometheus data source: 1. Visit http://localhost:3000 2. Configuration → Data Sources → Add Prometheus 3. URL: http://localhost:9090 4. Save & Test

Create dashboard with key metrics: - rate(activekg_refresh_cycles_total[5m]) - Refresh rate - activekg_search_latency{quantile="0.95"} - p95 search latency - rate(activekg_trigger_fired_total[1h]) - Trigger fire rate - activekg_active_nodes - Active node count

3. Alerts Configuration¶

Create alerts.yml:

href="#__codelineno-22-1">groups: - name: activekg_alerts rules: - alert: HighDriftRate expr: rate(activekg_refresh_cycles_total[5m]) > 10 for: 5m labels: severity: warning annotations: summary: "High drift rate detected" description: "Drift rate is {{ $value }} per second" - alert: TriggerStorm expr: rate(activekg_trigger_fired_total[1m]) > 100 for: 2m labels: severity: critical annotations: summary: "Trigger storm detected" description: "Trigger fire rate is {{ $value }} per second" - alert: SlowSearchQueries expr: activekg_search_latency{quantile="0.95"} > 1.0 for: 5m labels: severity: warning annotations: summary: "Search queries are slow" description: "p95 latency is {{ $value }}s" - alert: APIDown expr: up{job="activekg"} == 0 for: 1m labels: severity: critical annotations: summary: "Active Graph KG API is down"

4. Logging¶

Configure structured logging in logging.conf:

[loggers]
keys=root,activekg

[handlers]
keys=console,file

[formatters]
keys=json

[logger_root]
level=INFO
handlers=console

[logger_activekg]
level=INFO
handlers=console,file
qualname=activekg
propagate=0

[handler_console]
class=StreamHandler
level=INFO
formatter=json
args=(sys.stdout,)

[handler_file]
class=handlers.RotatingFileHandler
level=INFO
formatter=json
args=('/var/log/activekg/app.log', 'a', 100*1024*1024, 5)

[formatter_json]
format={"time": "%(asctime)s", "level": "%(levelname)s", "logger": "%(name)s", "message": "%(message)s"}

Backup & Disaster Recovery¶

1. Database Backups¶

Create /opt/activekg/backup.sh:

#!/bin/bash
BACKUP_DIR="/var/backups/activekg"
DATE=$(date +%Y%m%d_%H%M%S)

mkdir -p $BACKUP_DIR

# Full database dump
pg_dump $ACTIVEKG_DSN | gzip > $BACKUP_DIR/activekg_$DATE.sql.gz

# Keep only last 7 days
find $BACKUP_DIR -name "activekg_*.sql.gz" -mtime +7 -delete

echo "Backup completed: activekg_$DATE.sql.gz"

Add cron job:

crontab -e
# Daily backup at 2 AM
0 2 * * * /opt/activekg/backup.sh >> /var/log/activekg/backup.log 2>&1

2. Point-in-Time Recovery¶

Enable WAL archiving in postgresql.conf:

wal_level = replica
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f'

3. Restore Procedure¶

# Stop API
sudo systemctl stop activekg

# Drop and recreate database
psql -U postgres -c "DROP DATABASE activekg;"
psql -U postgres -c "CREATE DATABASE activekg;"

# Restore from backup
gunzip -c /var/backups/activekg/activekg_20251104_020000.sql.gz | psql $ACTIVEKG_DSN

# Start API
sudo systemctl start activekg

Partitioning Strategies (Optional, for large datasets)¶

For multi-tenant datasets with high volume, consider table partitioning to improve vacuum/ANALYZE efficiency and reduce bloat.

Option A: Partition by tenant_id (LIST)¶

-- Create a partitioned copy of nodes
CREATE TABLE nodes_partitioned (LIKE nodes INCLUDING ALL) PARTITION BY LIST (tenant_id);

-- Example tenant partition
CREATE TABLE nodes_tenant_acme PARTITION OF nodes_partitioned FOR VALUES IN ('acme');

-- Migrate data (example)
INSERT INTO nodes_partitioned SELECT * FROM nodes WHERE tenant_id = 'acme';

-- Indexes per partition (as needed)
CREATE INDEX idx_nodes_tenant_acme_embedding ON nodes_tenant_acme USING ivfflat (embedding vector_cosine_ops);

Option B: Partition by created_at (RANGE)¶

-- Create a time-series partitioned table
CREATE TABLE nodes_timeseries (LIKE nodes INCLUDING ALL) PARTITION BY RANGE (created_at);

-- Monthly partitions
CREATE TABLE nodes_2025_11 PARTITION OF nodes_timeseries FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');
CREATE TABLE nodes_2025_12 PARTITION OF nodes_timeseries FOR VALUES FROM ('2025-12-01') TO ('2026-01-01');

-- Indexes per partition
CREATE INDEX idx_nodes_2025_11_embedding ON nodes_2025_11 USING ivfflat (embedding vector_cosine_ops);

Notes: - Keep ANN indexes aligned with SEARCH_DISTANCE (cosine vs l2). - Adjust autovacuum settings per partition based on write frequency. - Consider moving partitions to different tablespaces to balance IO.

Performance Optimization¶

1. Vector Index Tuning¶

For HNSW index (recommended for >100K nodes):

-- Drop IVFFLAT if it exists
DROP INDEX IF EXISTS idx_nodes_embedding;

-- Create HNSW index
CREATE INDEX idx_nodes_embedding ON nodes
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Analyze for query planner
ANALYZE nodes;

2. Connection Pooling¶

Install pgbouncer:

sudo apt-get install pgbouncer

Configure /etc/pgbouncer/pgbouncer.ini:

[databases]
activekg = host=localhost dbname=activekg

[pgbouncer]
listen_addr = 127.0.0.1
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25

Update application DSN:

ACTIVEKG_DSN='postgresql://activekg:PASSWORD@localhost:6432/activekg'

Production Deployment Checklist¶

Critical (Must-Have for Production)¶

Recommended (Nice-to-Have)¶

Connection pooling (pgbouncer)
Database tuning (shared_buffers, work_mem)
HNSW index for vector search (>100K nodes)
Point-in-time recovery (WAL archiving)
Load balancing (multiple API instances)
CDN for static assets
Error tracking (Sentry, Rollbar)
APM monitoring (New Relic, DataDog)

Optional (Future Enhancements)¶

Post-Deployment Verification¶

1. Health Checks¶

# API health
curl https://activekg.example.com/health

# Prometheus metrics
curl https://activekg.example.com/prometheus | grep activekg_

# Database connectivity
psql $ACTIVEKG_DSN -c "SELECT count(*) FROM nodes;"

2. Load Testing¶

Install locust:

pip install locust

Create locustfile.py:

from locust import HttpUser, task, between
import random

class ActiveKGUser(HttpUser):
    wait_time = between(1, 5)

    @task(3)
    def search(self):
        self.client.post("/search", json={
            "query": "machine learning",
            "top_k": 10
        })

    @task(1)
    def create_node(self):
        self.client.post("/nodes", json={
            "classes": ["Document"],
            "props": {"text": f"Test document {random.randint(1, 1000)}"}
        })

Run load test:

locust --host=https://activekg.example.com --users 100 --spawn-rate 10

3. Security Audit¶

# Check for exposed secrets
grep -r "password\|secret\|key" /opt/activekg/ --exclude-dir=venv

# Check file permissions
ls -la /opt/activekg/.env  # Should be 600 or 400

# Check open ports
sudo netstat -tuln | grep LISTEN

# Check SSL certificate
openssl s_client -connect activekg.example.com:443 -servername activekg.example.com

Rollback Plan¶

If deployment fails:

Stop new API instances
Restore database from latest backup
Revert code to previous version
Restart old API instances
Verify health checks pass
Review logs for root cause

# Emergency rollback script
#!/bin/bash
set -e

echo "Starting rollback..."

# Stop new service
sudo systemctl stop activekg

# Restore database
LATEST_BACKUP=$(ls -t /var/backups/activekg/*.sql.gz | head -1)
gunzip -c $LATEST_BACKUP | psql $ACTIVEKG_DSN

# Revert code
cd /opt/activekg
git checkout main  # Or previous release tag
source venv/bin/activate
pip install -r requirements.txt

# Restart service
sudo systemctl start activekg

echo "Rollback complete"

Support & Escalation¶

Runbooks¶

High Drift Rate → Check if source data changed → Adjust drift thresholds
Trigger Storm → Check pattern thresholds → Temporarily disable problematic triggers
Slow Searches → Check vector index exists → Rebuild index → Add more workers
API Down → Check logs → Restart service → Check database connectivity

On-Call Contacts¶

Primary: [Name] - [Email] - [Phone]
Secondary: [Name] - [Email] - [Phone]
Database Admin: [Name] - [Email] - [Phone]

Status: Ready for Production Deployment (90% → 100% with JWT + rate limiting)

Next Steps: 1. Implement JWT authentication 2. Configure rate limiting 3. Run full test suite 4. Deploy to staging 5. Load test 6. Deploy to production