Commandant
AI Test Automation as a Service (AITaaS) — a QA Engineering Framework
Project Overview
Objective: Build a QA engineering framework that treats large-language-model capability as a first-class test automation primitive — not a novelty — producing measurable coverage gains, actionable PR review, and repeatable quality assessment against real codebases.
Three Enterprise-Grade AI Testing Patterns:
- AI Scenario Generation — Claude generates Playwright end-to-end test scenario matrices from user story descriptions, systematically expanding edge-case coverage beyond what manual test design typically produces.
- AI Code Review in CI/CD — A Gitea Actions workflow step analyzes pull requests against defined QA standards and posts structured, actionable feedback as PR comments.
- Quality Engineering Assessment — Deterministic validators combined with Claude API scoring assess codebases across eight quality dimensions (Architecture, Code Organization, Testing, Security, Documentation, DevOps/CI, Production Readiness, and Overall Quality), producing HTML reports with JSON sidecars and trend tracking.
Platform Basis: Built on pytest + playwright (Python) with the Anthropic SDK, designed to grow from a single application target into a multi-application AITaaS platform where each target gets its own test module, marker, and CI service step without framework changes.
3.13
API
Framework Architecture
💡 Design Principle: Rule-Based Before AI
Every AI-powered step is preceded by deterministic validation that runs fast, is free, and produces no false negatives on well-formed input. AI augments coverage — it does not replace structural correctness. This keeps AI costs bounded, CI fast when changes are clean, and diagnosis obvious when something fails: if a deterministic validator rejects a diff, the PR author sees a precise, reproducible reason without any model in the loop.
Layer 1
Deterministic Validators — syntax, selectors, credentials, markers
Layer 2
Quality Validators — codebase structure, security, CI hygiene
Layer 3
AI-Augmented — scenario generation, PR review, quality scoring
Layer 4
CI Enforcement — Gitea Actions workflows with quality gates
🤖 AI Scenario Generation
- Input: a plain-language user story ("User can reset their password via email link")
- Output: a Playwright test file in
commandant_tests/ai_generated/with happy-path plus enumerated edge cases - Prompt design: structured templates in
prompts/scenario_generation.pyconstrain Claude to produce valid pytest syntax, appropriate selectors, and assertions that match the target application's page objects - Coverage delta:
compare_coverage.pydiffs baseline vs AI-augmented runs to surface where AI extends real coverage rather than duplicating existing tests
📝 AI Code Review in CI/CD
- Trigger:
ai_review.ymlonpull_requestevents - Action:
review_pr.pyfetches the diff, sends it through Claude against defined QA standards, and posts structured feedback as a Gitea PR comment - Scope: calls out missing tests, questionable security assumptions, unchecked inputs, and architectural concerns — not style nits
- Signal/noise: feedback is structured (headings, severity, file/line references) so reviewers can triage quickly rather than wading through paragraph blobs
📊 Quality Engineering Assessment
- Eight dimensions: Architecture, Code Organization, Testing, Security, Documentation, DevOps/CI, Production Readiness, and Overall Quality
- Deterministic-only mode:
--skip-airuns the full validator suite without an API key, useful for fast local checks and for users without Anthropic credentials - Full mode: deterministic validators run first; then Claude scores each dimension with a rubric-bound prompt in
prompts/quality_assessment.py - Outputs: self-contained HTML report with JSON sidecar for programmatic consumption; trend tracking across repeat runs; quality gates via
--min-scorefor CI blocking - Stack-aware analyzers:
stacks/rails.py,stacks/generic.pywith an abstractStackAnalyzerbase — new stacks plug in without framework changes
🛠 Multi-Application Extension (AITaaS)
- Adding a target: one module under
commandant_tests/, one URL in.env.example, one marker inpytest.ini, one service step ine2e.yml - Phase 1 application: the SpinJockey Network Rails 8 platform is the inaugural target, exercising all three AI patterns end-to-end
- Phase roadmap: baseline hand-authored tests, then AI-generated augmentation, then black-box external assessment of deployed applications
- Portability:
docs/GITLAB_TRANSLATION_GUIDE.mdmaps the Gitea Actions workflow onto GitLab CI, proving the pipeline is not Gitea-bound
🐍 Python Test Layer
- pytest: orchestration, fixtures, markers, coverage
- Playwright (Python): headless Chromium for E2E
- Anthropic SDK: sync client in
claude_client.py - conftest.py: shared fixtures mirroring the target app's test conventions
🔧 CLI Tooling
- generate_scenarios.py: user story → Playwright test
- review_pr.py: PR diff → Gitea comment
- assess_quality.py: codebase → scored report
- compare_coverage.py: baseline vs AI-augmented delta
🔐 Safety & Cost Controls
- Validators first: AI never runs against malformed input
- Structured prompts: rubric-bound, deterministic output shape
- Credential scans: blocks inadvertent secret exfiltration
- Skip-AI mode: all validators available without API key
📦 Packaging & Deployment
- Dockerfile: production CLI image (Python 3.13-slim)
- Dockerfile.e2e: CI image — Ruby + Python + Playwright + Anthropic
- docker-compose.prod.yml: portal + shared reports volume
- docker-compose.test.yml: target app + Postgres for E2E runs
📚 Companion: Commandant Report Portal
Commandant generates assessment reports as HTML + JSON sidecars. The Commandant Report Portal is the companion web application that turns that output stream into an operable dashboard — browse, filter, score-track over time, and trigger new assessments from a UI rather than a shell.
The two projects are deliberately separated: Commandant is a CLI framework that runs in CI and produces artifacts; the Portal is a Flask web app that consumes those artifacts. CI/CD pipelines push reports to the Portal via an authenticated API, and human users browse the history through the web UI. Neither depends on the other at runtime — Commandant can run without the Portal, and the Portal can serve any compatible report directory.
📊 Report Dashboard
- Sortable index of assessment runs
- Score badges and severity counts at a glance
- Project filtering for multi-app scale
📋 Report Detail
- Per-dimension score tiles
- Rendered HTML preview + source toggle
- Export to JSON, HTML, and Print-to-PDF
📈 Trend Visualization
- Per-project score charts over time (Chart.js)
- Spot regressions across runs
- Correlate score movements with commits
🔒 Role-Based Access
- Admin and Viewer roles enforced server-side
- Admin-only user CRUD + assessment trigger
- Self-service profile / password change for all users
🚀 CI Integration
- REST API for report upload with API-key auth
- One-step
curlfrom any CI provider - Admin UI to trigger Gitea Actions assessment runs
🛡 Security Posture
- PBKDF2 password hashing (Werkzeug), signed sessions
- CSRF protection on all forms (API upload exempt, API-key)
- Flask-Limiter rate limit on POST /login
- Account lockout: 15 minutes after 5 failed attempts
- Forced password change for seeded users on first login
- HSTS + secure cookies under
FORCE_HTTPS=true - Path-traversal protection, system-path blocklist
- File extension blocklist:
.db,.sqlite,.key,.pem,.env - MIME validation on uploads; sandboxed iframes for HTML preview
Portal Stack: Flask, SQLAlchemy, Flask-WTF (CSRF), Flask-Limiter (rate limiting), gunicorn (WSGI). Sixty tests cover authentication, user CRUD, role enforcement, profile management, rate limiting, account lockout, CSRF, upload API validation, and security enforcement.
🎯 Engineering Highlights
🧠 AI with Guardrails
- Deterministic layer before every AI call — predictable behavior, bounded cost
- Rubric-bound prompts produce structured, parseable output — no regex fishing on free-text
- Skip-AI mode means the framework is still useful without an Anthropic key
🔧 QA Engineering Discipline
- Eight quality dimensions are the same rubric a human reviewer would apply
- Coverage delta is measured, not asserted — if AI-generated tests don't expand coverage, the pipeline tells you
- Stack-aware analyzers with a clean base class so adding Rails, Django, Node is routine
⚙️ CI Integration
- Three self-contained Gitea Actions workflows: lint, E2E, AI review
- Each workflow is triggered by the commit lifecycle event that justifies its cost
- Reports are durable artifacts: HTML for humans, JSON sidecars for downstream tools
📦 Extensibility
- Adding a new application target: five file edits, zero framework changes
- Portal + Framework are deployed independently but integrate through a documented contract
- GitLab CI translation guide demonstrates pipeline portability
🎯 Project Impact
Treats AI as a test-automation primitive, not a demo. The framework is built for the operational reality of CI/CD: it has cost controls, deterministic fallback, structured outputs, and durable artifacts. It doesn't assume infinite API budget or ideal inputs.
Closes the PR-review feedback loop with evidence. Rather than opining about code, it produces rubric-based scores across eight quality dimensions and tracks the scores over time. Teams can see whether quality is trending up or down, not just whether the build is green.
Demonstrates AITaaS architecture end-to-end. The framework + portal combination shows how AI-driven testing should be productized: a CLI that runs in CI, a web UI that consumes the artifacts, RESTful integration between them, role-based access for who can trigger and view, and security hygiene appropriate for a shared tool.