Commandant

AI Test Automation as a Service (AITaaS) — a QA Engineering Framework

Project Overview

Objective: Build a QA engineering framework that treats large-language-model capability as a first-class test automation primitive — not a novelty — producing measurable coverage gains, actionable PR review, and repeatable quality assessment against real codebases.

Three Enterprise-Grade AI Testing Patterns:

  1. AI Scenario Generation — Claude generates Playwright end-to-end test scenario matrices from user story descriptions, systematically expanding edge-case coverage beyond what manual test design typically produces.
  2. AI Code Review in CI/CD — A Gitea Actions workflow step analyzes pull requests against defined QA standards and posts structured, actionable feedback as PR comments.
  3. Quality Engineering Assessment — Deterministic validators combined with Claude API scoring assess codebases across eight quality dimensions (Architecture, Code Organization, Testing, Security, Documentation, DevOps/CI, Production Readiness, and Overall Quality), producing HTML reports with JSON sidecars and trend tracking.

Platform Basis: Built on pytest + playwright (Python) with the Anthropic SDK, designed to grow from a single application target into a multi-application AITaaS platform where each target gets its own test module, marker, and CI service step without framework changes.

3
AI Testing Patterns
8
Quality Dimensions
Python
3.13
Runtime
Claude
API
AI Layer

Framework Architecture

Gitea Actions CI/CD Triggers ci.yml (lint + static)   |   e2e.yml (Playwright suite)   |   ai_review.yml (PR review) Deterministic Validators (always run first) validators.py — syntax, selectors, markers quality_validators.py — structure, security credential scans, CI checks AI-Augmented Layer — Anthropic SDK (claude_client.py) Scenario Generation generate_scenarios.py story → Playwright tests PR Review review_pr.py diff → Gitea comment Quality Assessment assess_quality.py 8-dim scoring + HTML Outputs Generated test files (aiigenerated/) PR comments via Gitea API HTML + JSON report sidecars (only if validators pass)

💡 Design Principle: Rule-Based Before AI

Every AI-powered step is preceded by deterministic validation that runs fast, is free, and produces no false negatives on well-formed input. AI augments coverage — it does not replace structural correctness. This keeps AI costs bounded, CI fast when changes are clean, and diagnosis obvious when something fails: if a deterministic validator rejects a diff, the PR author sees a precise, reproducible reason without any model in the loop.

Layer 1

Deterministic Validators — syntax, selectors, credentials, markers

Layer 2

Quality Validators — codebase structure, security, CI hygiene

Layer 3

AI-Augmented — scenario generation, PR review, quality scoring

Layer 4

CI Enforcement — Gitea Actions workflows with quality gates

🤖 AI Scenario Generation

  • Input: a plain-language user story ("User can reset their password via email link")
  • Output: a Playwright test file in commandant_tests/ai_generated/ with happy-path plus enumerated edge cases
  • Prompt design: structured templates in prompts/scenario_generation.py constrain Claude to produce valid pytest syntax, appropriate selectors, and assertions that match the target application's page objects
  • Coverage delta: compare_coverage.py diffs baseline vs AI-augmented runs to surface where AI extends real coverage rather than duplicating existing tests

📝 AI Code Review in CI/CD

  • Trigger: ai_review.yml on pull_request events
  • Action: review_pr.py fetches the diff, sends it through Claude against defined QA standards, and posts structured feedback as a Gitea PR comment
  • Scope: calls out missing tests, questionable security assumptions, unchecked inputs, and architectural concerns — not style nits
  • Signal/noise: feedback is structured (headings, severity, file/line references) so reviewers can triage quickly rather than wading through paragraph blobs

📊 Quality Engineering Assessment

  • Eight dimensions: Architecture, Code Organization, Testing, Security, Documentation, DevOps/CI, Production Readiness, and Overall Quality
  • Deterministic-only mode: --skip-ai runs the full validator suite without an API key, useful for fast local checks and for users without Anthropic credentials
  • Full mode: deterministic validators run first; then Claude scores each dimension with a rubric-bound prompt in prompts/quality_assessment.py
  • Outputs: self-contained HTML report with JSON sidecar for programmatic consumption; trend tracking across repeat runs; quality gates via --min-score for CI blocking
  • Stack-aware analyzers: stacks/rails.py, stacks/generic.py with an abstract StackAnalyzer base — new stacks plug in without framework changes

🛠 Multi-Application Extension (AITaaS)

  • Adding a target: one module under commandant_tests/, one URL in .env.example, one marker in pytest.ini, one service step in e2e.yml
  • Phase 1 application: the SpinJockey Network Rails 8 platform is the inaugural target, exercising all three AI patterns end-to-end
  • Phase roadmap: baseline hand-authored tests, then AI-generated augmentation, then black-box external assessment of deployed applications
  • Portability: docs/GITLAB_TRANSLATION_GUIDE.md maps the Gitea Actions workflow onto GitLab CI, proving the pipeline is not Gitea-bound

🐍 Python Test Layer

  • pytest: orchestration, fixtures, markers, coverage
  • Playwright (Python): headless Chromium for E2E
  • Anthropic SDK: sync client in claude_client.py
  • conftest.py: shared fixtures mirroring the target app's test conventions

🔧 CLI Tooling

  • generate_scenarios.py: user story → Playwright test
  • review_pr.py: PR diff → Gitea comment
  • assess_quality.py: codebase → scored report
  • compare_coverage.py: baseline vs AI-augmented delta

🔐 Safety & Cost Controls

  • Validators first: AI never runs against malformed input
  • Structured prompts: rubric-bound, deterministic output shape
  • Credential scans: blocks inadvertent secret exfiltration
  • Skip-AI mode: all validators available without API key

📦 Packaging & Deployment

  • Dockerfile: production CLI image (Python 3.13-slim)
  • Dockerfile.e2e: CI image — Ruby + Python + Playwright + Anthropic
  • docker-compose.prod.yml: portal + shared reports volume
  • docker-compose.test.yml: target app + Postgres for E2E runs

📚 Companion: Commandant Report Portal

Commandant generates assessment reports as HTML + JSON sidecars. The Commandant Report Portal is the companion web application that turns that output stream into an operable dashboard — browse, filter, score-track over time, and trigger new assessments from a UI rather than a shell.

The two projects are deliberately separated: Commandant is a CLI framework that runs in CI and produces artifacts; the Portal is a Flask web app that consumes those artifacts. CI/CD pipelines push reports to the Portal via an authenticated API, and human users browse the history through the web UI. Neither depends on the other at runtime — Commandant can run without the Portal, and the Portal can serve any compatible report directory.

📊 Report Dashboard

  • Sortable index of assessment runs
  • Score badges and severity counts at a glance
  • Project filtering for multi-app scale

📋 Report Detail

  • Per-dimension score tiles
  • Rendered HTML preview + source toggle
  • Export to JSON, HTML, and Print-to-PDF

📈 Trend Visualization

  • Per-project score charts over time (Chart.js)
  • Spot regressions across runs
  • Correlate score movements with commits

🔒 Role-Based Access

  • Admin and Viewer roles enforced server-side
  • Admin-only user CRUD + assessment trigger
  • Self-service profile / password change for all users

🚀 CI Integration

  • REST API for report upload with API-key auth
  • One-step curl from any CI provider
  • Admin UI to trigger Gitea Actions assessment runs

🛡 Security Posture

  • PBKDF2 password hashing (Werkzeug), signed sessions
  • CSRF protection on all forms (API upload exempt, API-key)
  • Flask-Limiter rate limit on POST /login
  • Account lockout: 15 minutes after 5 failed attempts
  • Forced password change for seeded users on first login
  • HSTS + secure cookies under FORCE_HTTPS=true
  • Path-traversal protection, system-path blocklist
  • File extension blocklist: .db, .sqlite, .key, .pem, .env
  • MIME validation on uploads; sandboxed iframes for HTML preview

Portal Stack: Flask, SQLAlchemy, Flask-WTF (CSRF), Flask-Limiter (rate limiting), gunicorn (WSGI). Sixty tests cover authentication, user CRUD, role enforcement, profile management, rate limiting, account lockout, CSRF, upload API validation, and security enforcement.

🎯 Engineering Highlights

🧠 AI with Guardrails

  • Deterministic layer before every AI call — predictable behavior, bounded cost
  • Rubric-bound prompts produce structured, parseable output — no regex fishing on free-text
  • Skip-AI mode means the framework is still useful without an Anthropic key

🔧 QA Engineering Discipline

  • Eight quality dimensions are the same rubric a human reviewer would apply
  • Coverage delta is measured, not asserted — if AI-generated tests don't expand coverage, the pipeline tells you
  • Stack-aware analyzers with a clean base class so adding Rails, Django, Node is routine

⚙️ CI Integration

  • Three self-contained Gitea Actions workflows: lint, E2E, AI review
  • Each workflow is triggered by the commit lifecycle event that justifies its cost
  • Reports are durable artifacts: HTML for humans, JSON sidecars for downstream tools

📦 Extensibility

  • Adding a new application target: five file edits, zero framework changes
  • Portal + Framework are deployed independently but integrate through a documented contract
  • GitLab CI translation guide demonstrates pipeline portability

🎯 Project Impact

Treats AI as a test-automation primitive, not a demo. The framework is built for the operational reality of CI/CD: it has cost controls, deterministic fallback, structured outputs, and durable artifacts. It doesn't assume infinite API budget or ideal inputs.

Closes the PR-review feedback loop with evidence. Rather than opining about code, it produces rubric-based scores across eight quality dimensions and tracks the scores over time. Teams can see whether quality is trending up or down, not just whether the build is green.

Demonstrates AITaaS architecture end-to-end. The framework + portal combination shows how AI-driven testing should be productized: a CLI that runs in CI, a web UI that consumes the artifacts, RESTful integration between them, role-based access for who can trigger and view, and security hygiene appropriate for a shared tool.