Proof Points & Observability Guide¶
Overview¶
This guide covers the Active Graph KG proof points framework and observability stack, including automated validation scripts, metrics collection, and visualization.
Proof Points Report¶
Quick Start¶
Generate a comprehensive proof points report:
export TOKEN='<your-admin-jwt>'
export API=http://localhost:8000
# Basic report (metrics only)
make proof-report
# With live timing tests (creates test nodes)
export RUN_PROOFS=1
make proof-report
# View report
cat evaluation/PROOF_POINTS_REPORT.md
Report Sections¶
The proof points report includes:
- Environment - API URL and database connection
- Health - System status, LLM backend configuration
- Embedding Health - Coverage percentage and staleness metrics
- Search/Ask Activity - Request counts by mode and score type
- Latency Snapshot - p50/p95 search latency percentiles
- ANN Snapshot - Index configuration and top similarity scores
- Embedding Coverage by Class - Per-class breakdown (top 5)
- Scheduler Summary - Last run timestamps for scheduled jobs
- Trigger Effectiveness - Total triggers fired, pattern matching status
- Proof Metrics - Live DX timing and ingestion E2E latency (if
RUN_PROOFS=1)
Live Proof Tests¶
When RUN_PROOFS=1 is set, the report executes:
- DX Timing Test (
scripts/dx_timing.sh) - Creates a test node
- Refreshes to generate embedding
- Searches until result appears
-
Reports time-to-searchable metric
-
Ingestion Pipeline Test (
scripts/ingestion_pipeline.sh) - Creates a document
- Forces refresh
- Waits for searchability
- Reports end-to-end latency
Note: Live tests create temporary test nodes. Use with caution in production.
Validation Scripts¶
Core Validation Suite¶
| Script | Purpose | Token Required | Creates Test Data |
|---|---|---|---|
metrics-probe |
Scrape Prometheus metrics | No | No |
live-smoke |
CRUD + search validation | Yes | Yes (temporary) |
live-extended |
Lineage, drift, events | Yes | Yes (temporary) |
proof-report |
Generate markdown report | Yes | Optional |
Proof Point Scripts¶
| Script | Metric | Token Required | Creates Test Data |
|---|---|---|---|
dx-timing |
Time to first searchable answer | Yes | Yes |
ingestion-pipeline |
End-to-end ingestion latency | Yes | Yes |
scheduler-sla |
Scheduler inter-run intervals | Yes | No |
trigger-effectiveness |
Pattern matching and firing | Yes | Yes |
governance-audit |
RLS cross-tenant isolation | Yes (2 tokens) | Yes |
failure-recovery |
Graceful degradation checks | Yes | No |
Running Scripts¶
export TOKEN='<your-admin-jwt>'
export API=http://localhost:8000
# Individual scripts
make dx-timing
make ingestion-pipeline
make scheduler-sla
make trigger-effectiveness
# For governance audit (requires second tenant)
export SECOND_TOKEN='<other-tenant-jwt>'
make governance-audit
# Full validation flow
make live-smoke && make live-extended && make proof-report
Admin Endpoints¶
/_admin/metrics_summary¶
Returns runtime configuration and scheduler/trigger snapshots:
{
"version": "1.0.0",
"rate_limit_enabled": false,
"scheduler_enabled": true,
"triggers": {
"last_run": "2025-11-14T10:30:00Z",
"patterns_registered": 5
},
"scheduler": {
"jobs": {
"refresh_cycle": "2025-11-14T10:29:45Z",
"trigger_cycle": "2025-11-14T10:29:50Z"
}
}
}
Usage:
/_admin/embed_class_coverage¶
Returns embedding coverage breakdown by node class (top 50):
{
"classes": [
{
"class": "Document",
"total": 1000,
"with_embedding": 950,
"coverage_pct": 95.0
},
{
"class": "Paper",
"total": 500,
"with_embedding": 500,
"coverage_pct": 100.0
}
],
"count": 2
}
Usage:
/_admin/embed_info¶
Alias to /debug/embed_info for consistency. Returns embedding statistics:
{
"counts": {
"total_nodes": 1500,
"with_embedding": 1450,
"without_embedding": 50
},
"last_refreshed": {
"age_seconds": {
"min": 10.5,
"max": 600.2,
"avg": 150.3
}
}
}
Grafana Dashboard¶
Import Dashboard¶
-
Copy dashboard JSON:
-
In Grafana UI:
- Navigate to Dashboards → Import
- Paste JSON or upload
observability/grafana-dashboard.json - Select Prometheus datasource
- Click Import
Dashboard Panels¶
The Active Graph KG - Operations Dashboard includes:
- Search Requests Rate (Bar Gauge)
- Rate of vector/hybrid/text searches
-
Grouped by mode and score type
-
Search Latency p50/p95 (Time Series)
- 50th and 95th percentile latencies
-
Split by search mode
-
Embedding Coverage (Gauge)
- Percentage of nodes with embeddings
-
Per-tenant view
-
Max Embedding Staleness (Gauge)
- Time since least-recently-refreshed node
-
Thresholds: green (<300s), yellow (<600s), red (>600s)
-
Triggers Fired (Time Series Bars)
- Count of triggers fired per pattern
-
1-hour rolling window
-
Trigger Run Latency p50/p95 (Time Series)
- Trigger execution time percentiles
-
By trigger mode
-
Scheduler Runs (Time Series Bars)
- Count of scheduled job executions
-
Grouped by job_id and kind
-
Scheduler Inter-Run Interval p50/p95 (Time Series)
- Time between consecutive job runs
-
Per-job view for SLA monitoring
-
Node Refresh Latency by Result p50/p95 (Time Series)
- Embedding refresh time distribution
-
Split by ok/error outcomes
-
Ask Requests (Time Series Bars)
- LLM ask endpoint usage
- Grouped by rejection status
Prometheus Queries¶
Key metrics exposed at /prometheus:
activekg_search_requests_total{mode, score_type}activekg_search_latency_seconds_bucket{mode, score_type}activekg_embedding_coverage_ratio{tenant_id}activekg_embedding_max_staleness_seconds{tenant_id}activekg_triggers_fired_total{pattern, mode}activekg_trigger_run_latency_seconds_bucket{mode}activekg_schedule_runs_total{job_id, kind}activekg_schedule_inter_run_seconds_bucket{job_id}activekg_node_refresh_latency_seconds_bucket{result}activekg_ask_requests_total{score_type, rejected}
Custom Alerting Rules¶
Example Prometheus alert rules:
groups:
- name: activekg_slas
interval: 30s
rules:
- alert: HighSearchLatency
expr: histogram_quantile(0.95, rate(activekg_search_latency_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Search p95 latency above 500ms"
- alert: LowEmbeddingCoverage
expr: activekg_embedding_coverage_ratio < 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Embedding coverage below 80%"
- alert: SchedulerStalled
expr: time() - activekg_schedule_last_run_timestamp > 600
for: 5m
labels:
severity: critical
annotations:
summary: "Scheduler has not run in 10+ minutes"
CI/CD Integration¶
GitHub Actions Workflows¶
Nightly Proof Report¶
Located at .github/workflows/nightly-proof.yml:
- Schedule: Daily at 3 AM UTC
- Trigger: Manual dispatch also available
- Steps:
- Spin up pgvector PostgreSQL service
- Start API with LLM disabled
- Run
metrics-probe - Run
proof-report - Upload report artifact
Setup:
1. Add E2E_ADMIN_TOKEN secret to GitHub repo
2. Workflow will auto-run nightly
3. Download artifacts from Actions tab
Live Validation (Manual)¶
Located at .github/workflows/live-validation.yml:
- Trigger: Manual dispatch with optional inputs
- Inputs:
run_smoke: Execute live_smoke.shrun_extended: Execute live_extended.sh- Steps: Same as nightly, plus optional smoke/extended tests
Usage: 1. Go to Actions → Live Validation (Manual) 2. Click Run workflow 3. Select desired test suites 4. Download proof report artifact
Fetching Artifacts¶
From GitHub UI:
1. Navigate to Actions tab
2. Select workflow run
3. Download proof-points-report or nightly-proof-points-report artifact
From CLI (using gh tool):
# List recent workflow runs
gh run list --workflow=nightly-proof.yml
# Download latest artifact
gh run download --name nightly-proof-points-report
# View report
cat PROOF_POINTS_REPORT.md
Production Recommendations¶
Metrics Retention¶
Configure Prometheus retention based on usage:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
# Retention (adjust based on disk space)
storage:
tsdb:
retention.time: 30d
retention.size: 50GB
Dashboard Variables¶
Add tenant/environment template variables for multi-tenant deployments:
{
"templating": {
"list": [
{
"name": "tenant_id",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(activekg_search_requests_total, tenant_id)"
}
]
}
}
Proof Cadence¶
Recommended validation frequency:
- Nightly: Automated proof report (via GitHub Actions)
- Weekly: Manual
live-extendedrun withRUN_PROOFS=1 - Pre-release: Full suite including
governance-auditandfailure-recovery - On-demand: After infrastructure changes or schema migrations
Security Considerations¶
- JWT Tokens: Store as GitHub secrets, rotate regularly
- Test Data: Live proof tests create nodes; ensure cleanup in scripts
- Rate Limits: Proof scripts can generate burst traffic; disable rate limiting for test tokens
- Multi-Tenancy: Use
SECOND_TOKENfor governance audit only in isolated test environments
Troubleshooting¶
Common Issues¶
"Missing Authorization header"
- Ensure TOKEN is exported and valid
- Check JWT expiration (exp claim)
- Verify JWT_ENABLED=true in API config
"Proof report shows zero metrics"
- Run make live-smoke first to populate metrics
- Check Prometheus endpoint: curl http://localhost:8000/prometheus
- Verify scheduler is running if checking schedule_runs_total
"RUN_PROOFS=1 doesn't execute tests"
- Export variable: export RUN_PROOFS=1
- Don't use RUN_PROOFS=1 make proof-report (shell expansion issue)
- Check script permissions: chmod +x scripts/*.sh
"Grafana dashboard shows no data"
- Verify Prometheus datasource is configured
- Check time range (default: last 1 hour)
- Ensure metrics are being scraped: /prometheus endpoint accessible
Next Steps¶
- Extend Metrics: Add connector-specific metrics (GCS, S3, Drive) using same pattern
- Custom Proofs: Create domain-specific validation scripts in
scripts/ - Alerting: Set up Prometheus AlertManager for proactive monitoring
- Tracing: Integrate OpenTelemetry for distributed tracing
- Chaos Testing: Add
/_admin/simulate_failureendpoint (opt-in, guarded)