Skip to content

ADR-14: AI-Based Spec Document Generation Pipeline

Korean Version

DateAuthorRepos
2026-01-18@KubrickCodeworker, web, infra

Context

Problem Statement

The Specvital platform requires automated conversion of test file collections into human-readable specification documents. This involves two distinct cognitive tasks:

  1. Classification: Grouping tests by domain and feature (semantic understanding)
  2. Conversion: Transforming test names into behavior descriptions (language generation)

Requirements

RequirementDescription
ScaleHandle repositories with thousands of tests (up to 500+ per chunk)
CostHigh-volume processing requires cost-efficient model selection
ConsistencyCross-chunk domain assignment must remain coherent
ReliabilityProduction system requires fault tolerance and graceful degradation
LatencyGemini server-side timeout of 5 minutes constrains per-request scope

Constraints

ConstraintImpact
PostgreSQL BackendCaching integrated with existing River queue infrastructure
Content-Hash CachingDocuments indexed by (content_hash, language, model_id)
Worker ArchitectureMust integrate with existing worker service patterns
Quota SystemCache hits must not consume user quota (ADR-13)

Decision

Adopt a two-phase AI pipeline using Google Gemini models with sequential chunking and anchor propagation for large repository handling.

Pipeline Architecture

Phase 1: Classification (gemini-2.5-flash)
├── Input: Test files + domain hints
├── Chunking: Max 500 tests, 50K tokens per chunk
├── Output: Domain grouping, feature assignment, confidence scores
└── Anchor Propagation: Carry forward domain decisions across chunks

Phase 2: Conversion (gemini-2.5-flash-lite)
├── Input: Per-feature batch of tests
├── Concurrency: Max 5 parallel API calls
├── Output: Test name → behavior description
└── Fallback: Original test name with 0.0 confidence on failure

Configuration

ParameterValueRationale
Phase 1 Timeout270sBelow Gemini's 5-minute server limit
Phase 2 Total Timeout7 minutesAllows processing of large feature sets
Phase 2 Feature Timeout90sPer-feature isolation for partial success
Phase 2 Concurrency5Balance API rate limits vs throughput
Inter-chunk Delay5 secondsRespect rate limits between chunks
Max Tests Per Chunk500Reduced from 1000 to avoid 504 errors
Max Tokens Per Chunk50,000Keep response under 60s
Thinking ModeDisabledCost optimization for structured tasks

Reliability Patterns

MechanismPhase 1Phase 2
Circuit Breaker Threshold5 failures3 failures
Retry Attempts32
Backoff StrategyExponentialExponential
Rate LimiterGlobal sharedGlobal shared

Caching Strategy

  • Cache Key: (content_hash, language, model_id)
  • Cache hits return existing document without consuming quota
  • Hierarchical Schema: spec_documents → spec_domains → spec_features → spec_behaviors

Options Considered

A. LLM Provider Selection

OptionVerdict
Gemini 2.5 Flash (Selected)1M token context, $0.30/1M input cost-effective, Flash-Lite for Phase 2
Claude 3.5/Opus 4.5Superior reasoning but $5/1M (17x more expensive), 200K context limit
GPT-4o/GPT-5.xBroad ecosystem but $2-3/1M, 128K context limit

Selection Rationale:

  • Cost efficiency: Flash at $0.30/1M, Flash-Lite at $0.10/1M tokens
  • Context window: 1M token capacity handles large test files without complex chunking
  • Task fit: Classification and conversion are structured tasks that do not require advanced reasoning
  • Two-tier availability: Flash for classification quality, Flash-Lite for high-volume conversion

B. Pipeline Architecture

OptionVerdict
Two-phase (Classification → Conversion) (Selected)Separation of concerns, different cost profiles, independent failure domains
Single-pass PipelineSimpler but harder to debug, all-or-nothing failure
Multi-pass with Semantic AnalysisHighest quality but 3+ API calls, complex orchestration

Selection Rationale:

  • Task specialization: Classification and conversion have different optimal prompting strategies
  • Cost optimization: Phase 2 uses cheaper Flash-Lite model
  • Failure isolation: Phase 2 failures can fallback without losing Phase 1 classification
  • Independent tuning: Each phase can be optimized separately

C. Large Repository Handling

OptionVerdict
Sequential Chunking with Anchor Propagation (Selected)Predictable memory, consistent domain assignment via anchors
Parallel ChunkingMaximum throughput but domain inconsistency across chunks
Stream ProcessingMinimal memory but no cross-document context

Selection Rationale:

  • Consistency: Anchor domains propagate across chunks, preventing "feature drift"
  • Rate limit compliance: Inter-chunk delay (5s) naturally respects API limits
  • Debuggability: Clear chunk boundaries enable reproduction of issues
  • Memory predictability: Fixed chunk size prevents OOM

Implementation Details

Model Configuration

Deterministic Output Settings:

ParameterValueRationale
Temperature0.0Eliminate randomness for reproducible output
Seed42Fixed seed for consistent classification
MaxOutputTokens65,536Gemini maximum to prevent truncation
ResponseMIMETypeapplication/jsonStructured output enforcement
ThinkingBudget0Disable dynamic thinking overhead

Token Usage Tracking

Real-time token tracking per analysis for cost monitoring:

go
type TokenUsage struct {
    CandidatesTokens int32   // Output tokens
    Model            string  // Model identifier
    PromptTokens     int32   // Input tokens
    TotalTokens      int32   // Sum of input + output
}
  • Extract from GenerateContentResponse.UsageMetadata
  • Aggregate across Phase 1/2 calls per analysis_id
  • Structured log output: specview_token_usage

Rationale: Google AI Studio usage statistics are delayed ~1 day, making real-time per-repository cost tracking impossible without custom extraction.

Chunk Size Evolution

Progressive reduction to resolve 504 DEADLINE_EXCEEDED errors:

IterationTests/ChunkResult
Initial10,000JSON truncation
v23,000Still 504 errors on large repos
v31,000Improved but occasional timeout
Final500Stable 15-25s processing time

Prompt Engineering

Phase 1 (Classification):

  • Classification constraints with confidence scoring
  • Language-specific instructions for output format
  • Domain hint integration from analysis results

Phase 2 (Conversion):

  • "Specification Notation" style: completion states, not actions
  • Example: "User successfully authenticated" not "User should be able to authenticate"
  • Confidence scoring for downstream filtering

Error Handling Flow

Phase 1 Failure
├── Retry (up to 3 attempts with exponential backoff)
├── Circuit breaker trips after 5 consecutive failures
└── Job marked as failed, no partial results

Phase 2 Failure (per feature)
├── Retry (up to 2 attempts)
├── Fallback: Original test name with 0.0 confidence
└── Continue processing other features

Retryable Error Patterns:

go
retryablePatterns := []string{
    "rate limit", "quota exceeded", "too many requests",
    "service unavailable", "internal server error", "timeout",
    "connection reset", "connection refused", "temporary failure",
}

JSON Parse Error Handling: Gemini occasionally returns truncated JSON responses. JSON parse errors are wrapped as RetryableError to trigger automatic retry with exponential backoff.

Data Flow

┌─────────────────────────────────────────────────────────────────────┐
│                    SpecView Generation Pipeline                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────┐     ┌─────────────────────┐     ┌──────────────┐  │
│  │ Analysis    │────▶│  Phase 1: Classify  │────▶│ Phase 2:     │  │
│  │ Results     │     │  (gemini-2.5-flash) │     │ Convert      │  │
│  │ + Hints     │     │                     │     │ (flash-lite) │  │
│  └─────────────┘     └─────────────────────┘     └──────┬───────┘  │
│                                                          │          │
│                                                          ▼          │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                     PostgreSQL                               │   │
│  │  spec_documents                                              │   │
│  │    └── spec_domains                                          │   │
│  │          └── spec_features                                   │   │
│  │                └── spec_behaviors                            │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Consequences

Positive

AreaBenefit
Cost EfficiencyFlash-Lite at $0.10/1M tokens reduces Phase 2 cost by 60%+
ReliabilityCircuit breaker and fallback ensure degraded operation over complete failure
ScalabilityChunking handles repositories of any size with predictable resources
Cache IntegrationContent-hash caching prevents redundant processing and quota consumption
ObservabilityTwo-phase separation provides clear metrics boundaries
ConsistencyAnchor propagation maintains domain naming coherence

Negative

AreaTrade-off
LatencySequential chunking adds processing time proportional to repo size
Vendor Lock-inDeep Gemini integration requires migration effort if provider change needed
Deprecation RiskGemini 2.5 Flash scheduled for deprecation June 2026
Quality CeilingGemini classification accuracy below Claude for complex test hierarchies
Fallback QualityPhase 2 fallback (original test name) provides no semantic improvement
Thinking DisabledCost optimization trades potential quality improvement

Technical Implications

AspectImplication
SchemaHierarchical: spec_documents → spec_domains → spec_features → spec_behaviors
Cache Key(content_hash, language, model_id) enables model-version-aware caching
Timeout Design270s Phase 1 stays below Gemini's 5-minute server limit
Concurrency ModelGlobal shared rate limiter coordinates across phases
Error RecoveryConfidence scoring enables downstream filtering of low-quality conversions

References

Open-source test coverage insights