Skip to main content
The Legibility Scorecard analyzes a GitHub repository and scores how easy it is for coding agents to navigate, understand, and contribute to the codebase. It is based on OpenAI’s Agentic Legibility demo and adapted for Twill.

How it works

  1. You submit a repo — paste a public GitHub URL, or select a private repo connected to your workspace.
  2. A hosted shell runs the audit — Twill sends the request to the OpenAI Responses API, which provisions a sandboxed Linux container. The container clones your repo and runs a deterministic Python scoring script.
  3. The scorer analyzes file patterns — The script (score_repo.py) walks the repo tree and checks for specific files, patterns, and conventions across seven metrics. No dependencies are installed, no code is executed — it is purely static analysis.
  4. Results stream back live — Shell commands, stdout/stderr, and the final scorecard stream to the UI in real time via NDJSON.

Seven metrics

Each metric is scored 0–3 based on the presence and quality of specific signals:
MetricWhat it measures
Bootstrap self-sufficiencyCan an agent set up the project from a cold clone? Looks for setup scripts, dependency lockfiles, Docker configs, and Makefiles.
Task entrypointsAre there clear starting points for work? Checks for issue templates, TODO files, CONTRIBUTING guides, and task runners.
Validation harnessCan an agent verify its own changes? Looks for test suites, CI configs, and test scripts.
Lint & format gatesAre style checks automated? Checks for linter configs, pre-commit hooks, and format scripts.
Agent repo mapIs there explicit guidance for AI agents? Looks for AGENTS.md, CLAUDE.md, Cursor rules, and similar files.
Structured docsIs the project well-documented? Checks for READMEs, architecture docs, API docs, and changelogs.
Decision recordsAre past decisions recorded? Looks for ADRs, RFCs, and design documents.

Scoring

  • Each metric produces a score from 0 (no signals) to 3 (strong signals).
  • The overall score is the sum across all seven metrics (max 21).
  • A letter grade is assigned based on the percentage: A (85%+), B (70%+), C (50%+), D (below 50%).

Quick wins

The scorer identifies low-effort improvements that would most improve your score — for example, adding a CLAUDE.md file or a setup script.

Public vs. private repos

  • Public repos can be analyzed by anyone with a Twill account — just paste the GitHub URL.
  • Private repos require connecting your GitHub App installation to your Twill workspace. Twill uses a short-lived token scoped to the repos you grant access to.

Attribution

The scoring rubric and methodology are based on OpenAI’s Practices for Governing Agentic AI Systems and their agentic legibility skill demo. The scoring script runs inside OpenAI’s hosted shell environment via the Responses API.