Skip to content

Core ADR-16: Domain Hints Extraction System

Korean Version

DateAuthorRepo
2026-01-18@KubrickCodecore

Context

The Domain Classification Problem

The AI-based SpecView generation pipeline (ADR-14) requires domain classification of tests to group them into business domains (Authentication, Payment, UserManagement, etc.). Without context about what code is being tested, AI models cannot categorize tests meaningfully.

Challenge: Provide AI with semantic context without:

  • Sending entire source files (excessive token consumption)
  • Including noise that dilutes classification signal
  • Building separate extractors for each of 12+ supported languages

Requirements

RequirementDescription
Signal DensityHigh ratio of meaningful domain indicators to tokens
Token EfficiencyMinimize AI input tokens while preserving quality
Cross-LanguageUnified extraction approach across all languages
Noise ImmunityFilter universal and language-specific non-domain patterns
AST AccuracyDistinguish code from comments and strings

Constraints

ConstraintImpact
Tree-sitter DependencyMust integrate with existing AST infrastructure
Parallel ScanningExtraction must not block worker pool performance
Memory BudgetCannot load full ASTs for large repositories

Decision

Adopt a dual-field extraction model (Imports + Calls) with aggressive noise filtering and 2-segment call normalization.

go
type DomainHints struct {
    Imports []string  // Deduplicated import paths
    Calls   []string  // Normalized to 2 segments (a.b.c() → a.b)
}

Key Design Choices

  1. Imports: Direct indicators of external dependencies and their domains
  2. Calls: Reveal interaction patterns with domain entities
  3. 2-segment normalization: stripe.customers.create()stripe.customers
  4. No Variables field: Removed after empirical testing showed no classification improvement

Token Impact

MetricBeforeAfterImprovement
Token Volume600K90K85% ↓
ClassificationBaselineEquivalentMaintained

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Domain Hints Extraction                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Test File → Tree-sitter Parse → Language Extractor             │
│                                       │                          │
│              ┌────────────────────────┼────────────────────────┐│
│              │                        │                        ││
│              ▼                        ▼                        ▼│
│         Import Extraction       Call Extraction          Noise Filter│
│         - ES6 import            - Method calls           - Universal │
│         - CommonJS require      - 2-segment norm         - Per-lang  │
│         - Go import             - Chain flatten                     │
│         - Python import                                             │
│              │                        │                             │
│              └────────────┬───────────┘                             │
│                           ▼                                         │
│                    DomainHints{Imports, Calls}                     │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Language-Specific Extractors

LanguageImport PatternsCall Extraction
Goimport "pkg", import ("pkg")Function calls, methods
JavaScript/TSimport x from, require()Method chains
Pythonimport x, from x import yFunction/method calls
Java/Kotlinimport package.ClassStatic/instance methods
C#using Namespace;Static/instance methods
Rubyrequire, require_relativeMethod calls
PHPuse Namespace\ClassFunction/method calls
Rustuse crate::moduleFunction/method calls
Swiftimport ModuleFunction/method calls
C++#include <header>Function/method calls

Call Normalization

Input:  stripe.customers.subscriptions.create()
Output: stripe.customers

Input:  authService.validateToken()
Output: authService.validateToken

Input:  db.query()
Output: db.query

Rationale: 2 segments preserve domain entity relationships while preventing token explosion.

Noise Filtering

Universal Filters:

PatternExampleReason
Empty strings""No signal
Leading brackets[itemSpread array artifacts
URLshttp://...Test fixtures
Inline comments// commentParser leakage
Short identifiersa, fn, xGeneric, no domain signal

Language-Specific Filters:

LanguageFiltered Patterns
Gofmt, os, io, context, make, append
RustOk, Err, Some, None, unwrap
JavatoString, equals, hashCode, getClass
KotlinlistOf, mapOf, emptyList, setOf
C#System.*, nameof
JavaScriptconsole.*, JSON.*

Options Considered

Option A: Dual-Field Extraction (Selected)

Extract import statements and function calls only, with aggressive filtering.

AspectAssessment
Signal DensityHighest - imports and calls are strongest indicators
Token Cost85% reduction vs full extraction
Implementation12 language extractors with shared filter
Trade-offSome context loss from normalization

Option B: Full AST Extraction

Extract all identifiers, variables, string literals, and comments.

AspectAssessment
ContextMaximum information captured
Token Cost6-7x more tokens than Option A
Signal QualityDegraded by noise (loop variables, generics)
RejectionVariables field test showed no classification gain

Option C: Regex-Based Pattern Matching

Use regular expressions to extract import/require statements.

AspectAssessment
SimplicityNo tree-sitter dependency
AccuracyFalse positives from comments/strings
MaintenanceRegex per language per pattern
RejectionContradicts unified AST architecture (ADR-03)

Consequences

Positive

AreaBenefit
Token Efficiency85% reduction enables cost-effective AI at scale
ClassificationHigh signal-to-noise ratio improves AI accuracy
Cross-LanguageUnified pattern across 12 languages
IntegrationClean interface with tree-sitter infrastructure
EvolutionVariables removal demonstrates data-driven iteration

Negative

AreaTrade-off
Context LossString literals and comments not captured
NormalizationDeep call chains lose specificity
Maintenance12 extractors need grammar version updates
Tree-sitterNo lightweight extraction fallback

Technical Implications

AspectImplication
API SurfaceDomainHints is public API; changes affect Worker
TestingGolden snapshots per language for extraction behavior
Performance~1-2ms per file; negligible in parallel context
FutureAdditional fields can be added if AI requires them

References

Open-source test coverage insights