Commandant: AI Test Automation as a Service (AITaaS)

Project Overview

Objective: Build a QA engineering framework that treats large-language-model capability as a first-class test automation primitive — not a novelty — producing measurable coverage gains, actionable PR review, and repeatable quality assessment against real codebases.

Three Enterprise-Grade AI Testing Patterns:

AI Scenario Generation — Claude generates Playwright end-to-end test scenario matrices from user story descriptions, systematically expanding edge-case coverage beyond what manual test design typically produces.
AI Code Review in CI/CD — A Gitea Actions workflow step analyzes pull requests against defined QA standards and posts structured, actionable feedback as PR comments.
Quality Engineering Assessment — Deterministic validators combined with Claude API scoring assess codebases across eight quality dimensions (Architecture, Code Organization, Testing, Security, Documentation, DevOps/CI, Production Readiness, and Overall Quality), producing HTML reports with JSON sidecars and trend tracking.

Platform Basis: Built on pytest + playwright (Python) with the Anthropic SDK, designed to grow from a single application target into a multi-application AITaaS platform where each target gets its own test module, marker, and CI service step without framework changes.

AI Testing Patterns

Quality Dimensions

Python
3.13

Runtime

Claude
API

AI Layer

Framework Architecture

💡 Design Principle: Rule-Based Before AI

Every AI-powered step is preceded by deterministic validation that runs fast, is free, and produces no false negatives on well-formed input. AI augments coverage — it does not replace structural correctness. This keeps AI costs bounded, CI fast when changes are clean, and diagnosis obvious when something fails: if a deterministic validator rejects a diff, the PR author sees a precise, reproducible reason without any model in the loop.

Layer 1

Deterministic Validators — syntax, selectors, credentials, markers

Layer 2

Quality Validators — codebase structure, security, CI hygiene

Layer 3

AI-Augmented — scenario generation, PR review, quality scoring

Layer 4

CI Enforcement — Gitea Actions workflows with quality gates

🤖 AI Scenario Generation

Input: a plain-language user story ("User can reset their password via email link")
Output: a Playwright test file in commandant_tests/ai_generated/ with happy-path plus enumerated edge cases
Prompt design: structured templates in prompts/scenario_generation.py constrain Claude to produce valid pytest syntax, appropriate selectors, and assertions that match the target application's page objects
Coverage delta: compare_coverage.py diffs baseline vs AI-augmented runs to surface where AI extends real coverage rather than duplicating existing tests

📝 AI Code Review in CI/CD

Trigger: ai_review.yml on pull_request events
Action: review_pr.py fetches the diff, sends it through Claude against defined QA standards, and posts structured feedback as a Gitea PR comment
Scope: calls out missing tests, questionable security assumptions, unchecked inputs, and architectural concerns — not style nits
Signal/noise: feedback is structured (headings, severity, file/line references) so reviewers can triage quickly rather than wading through paragraph blobs

📊 Quality Engineering Assessment

Eight dimensions: Architecture, Code Organization, Testing, Security, Documentation, DevOps/CI, Production Readiness, and Overall Quality
Deterministic-only mode: --skip-ai runs the full validator suite without an API key, useful for fast local checks and for users without Anthropic credentials
Full mode: deterministic validators run first; then Claude scores each dimension with a rubric-bound prompt in prompts/quality_assessment.py
Outputs: self-contained HTML report with JSON sidecar for programmatic consumption; trend tracking across repeat runs; quality gates via --min-score for CI blocking
Stack-aware analyzers: stacks/rails.py, stacks/generic.py with an abstract StackAnalyzer base — new stacks plug in without framework changes

🛠 Multi-Application Extension (AITaaS)

Adding a target: one module under commandant_tests/, one URL in .env.example, one marker in pytest.ini, one service step in e2e.yml
Phase 1 application: the SpinJockey Network Rails 8 platform is the inaugural target, exercising all three AI patterns end-to-end
Phase roadmap: baseline hand-authored tests, then AI-generated augmentation, then black-box external assessment of deployed applications
Portability: docs/GITLAB_TRANSLATION_GUIDE.md maps the Gitea Actions workflow onto GitLab CI, proving the pipeline is not Gitea-bound

🐍 Python Test Layer

pytest: orchestration, fixtures, markers, coverage
Playwright (Python): headless Chromium for E2E
Anthropic SDK: sync client in claude_client.py
conftest.py: shared fixtures mirroring the target app's test conventions

🔧 CLI Tooling

generate_scenarios.py: user story → Playwright test
review_pr.py: PR diff → Gitea comment
assess_quality.py: codebase → scored report
compare_coverage.py: baseline vs AI-augmented delta

🔐 Safety & Cost Controls

Validators first: AI never runs against malformed input
Structured prompts: rubric-bound, deterministic output shape
Credential scans: blocks inadvertent secret exfiltration
Skip-AI mode: all validators available without API key

📦 Packaging & Deployment

Dockerfile: production CLI image (Python 3.13-slim)
Dockerfile.e2e: CI image — Ruby + Python + Playwright + Anthropic
docker-compose.prod.yml: portal + shared reports volume
docker-compose.test.yml: target app + Postgres for E2E runs

📚 Companion: Commandant Report Portal

Commandant generates assessment reports as HTML + JSON sidecars. The Commandant Report Portal is the companion web application that turns that output stream into an operable dashboard — browse, filter, score-track over time, and trigger new assessments from a UI rather than a shell.

The two projects are deliberately separated: Commandant is a CLI framework that runs in CI and produces artifacts; the Portal is a Flask web app that consumes those artifacts. CI/CD pipelines push reports to the Portal via an authenticated API, and human users browse the history through the web UI. Neither depends on the other at runtime — Commandant can run without the Portal, and the Portal can serve any compatible report directory.

📊 Report Dashboard

Sortable index of assessment runs
Score badges and severity counts at a glance
Project filtering for multi-app scale

📋 Report Detail

Per-dimension score tiles
Rendered HTML preview + source toggle
Export to JSON, HTML, and Print-to-PDF

📈 Trend Visualization

Per-project score charts over time (Chart.js)
Spot regressions across runs
Correlate score movements with commits

🔒 Role-Based Access

Admin and Viewer roles enforced server-side
Admin-only user CRUD + assessment trigger
Self-service profile / password change for all users

🚀 CI Integration

REST API for report upload with API-key auth
One-step curl from any CI provider
Admin UI to trigger Gitea Actions assessment runs

🛡 Security Posture

PBKDF2 password hashing (Werkzeug), signed sessions
CSRF protection on all forms (API upload exempt, API-key)
Flask-Limiter rate limit on POST /login
Account lockout: 15 minutes after 5 failed attempts
Forced password change for seeded users on first login
HSTS + secure cookies under FORCE_HTTPS=true
Path-traversal protection, system-path blocklist
File extension blocklist: .db, .sqlite, .key, .pem, .env
MIME validation on uploads; sandboxed iframes for HTML preview

Portal Stack: Flask, SQLAlchemy, Flask-WTF (CSRF), Flask-Limiter (rate limiting), gunicorn (WSGI). Sixty tests cover authentication, user CRUD, role enforcement, profile management, rate limiting, account lockout, CSRF, upload API validation, and security enforcement.

🎯 Engineering Highlights

🧠 AI with Guardrails

Deterministic layer before every AI call — predictable behavior, bounded cost
Rubric-bound prompts produce structured, parseable output — no regex fishing on free-text
Skip-AI mode means the framework is still useful without an Anthropic key

🔧 QA Engineering Discipline

Eight quality dimensions are the same rubric a human reviewer would apply
Coverage delta is measured, not asserted — if AI-generated tests don't expand coverage, the pipeline tells you
Stack-aware analyzers with a clean base class so adding Rails, Django, Node is routine

⚙️ CI Integration

Three self-contained Gitea Actions workflows: lint, E2E, AI review
Each workflow is triggered by the commit lifecycle event that justifies its cost
Reports are durable artifacts: HTML for humans, JSON sidecars for downstream tools

📦 Extensibility

Adding a new application target: five file edits, zero framework changes
Portal + Framework are deployed independently but integrate through a documented contract
GitLab CI translation guide demonstrates pipeline portability

🎯 Project Impact

Treats AI as a test-automation primitive, not a demo. The framework is built for the operational reality of CI/CD: it has cost controls, deterministic fallback, structured outputs, and durable artifacts. It doesn't assume infinite API budget or ideal inputs.

Closes the PR-review feedback loop with evidence. Rather than opining about code, it produces rubric-based scores across eight quality dimensions and tracks the scores over time. Teams can see whether quality is trending up or down, not just whether the build is green.

Demonstrates AITaaS architecture end-to-end. The framework + portal combination shows how AI-driven testing should be productized: a CLI that runs in CI, a web UI that consumes the artifacts, RESTful integration between them, role-based access for who can trigger and view, and security hygiene appropriate for a shared tool.