ADR-05: Worker-Scheduler Process Separation
🇰🇷 한국어 버전
| Date | Author | Repos |
|---|---|---|
| 2024-12-18 | @KubrickCode | worker |
Status
⚠️ Partially Superseded by ADR-22: Scheduler Removal and Railway Cron Migration
The Scheduler service described in this ADR has been removed. However, the binary separation pattern remains valid and is now applied to all worker binaries (analyzer, spec-generator, retention-cleanup). See ADR-22 for details.
Context
Asymmetric Scaling Requirements
Background job processing systems typically consist of two distinct components with fundamentally different scaling characteristics:
Scheduler:
- Triggers periodic jobs (cron-based task enqueuing)
- Must run as single instance to prevent duplicate executions
- Lightweight resource footprint (CPU <5%, memory <256 MB)
- Rarely changes (cron expression updates, new job types)
Worker:
- Processes queued tasks (analysis, file operations, API calls)
- Scales horizontally based on queue depth
- Heavy resource consumption (CPU-intensive, memory for large payloads)
- Frequent updates (business logic changes, algorithm improvements)
Problems with Combined Process
Running scheduler and worker in a single process creates several issues:
| Issue | Impact |
|---|---|
| Resource Waste | Scheduler code runs in all worker instances (unused) |
| Scaling Inefficiency | Can't scale workers without duplicating schedulers |
| Failure Coupling | Worker OOM crashes the scheduler |
| Deployment Lock | Scheduler changes require full redeployment |
| Distributed Lock Overhead | All instances acquire locks, only one succeeds |
Different Dependency Requirements
Each process type has distinct dependency needs:
- Worker: Requires encryption keys (OAuth token decryption for private repositories)
- Scheduler: Requires distributed lock (single-instance guarantee), no encryption needed
Combining these creates unnecessary security exposure and configuration complexity.
Decision
Separate Worker and Scheduler into independent processes with dedicated entry points and DI containers.
Architecture
┌──────────────┐ ┌───────────┐ ┌──────────────┐
│ Scheduler │─────>│PostgreSQL │<─────│ Workers │
│ (1 instance) │ │River Queue│ │ (0-N scaled) │
└──────────────┘ └───────────┘ └──────────────┘
│ │ │
└────────────────────┴────────────────────┘
│
┌──────────────┐
│ PostgreSQL │
│ (Data Store) │
└──────────────┘Entry Point Separation
cmd/
├── worker/main.go # Queue task processing
└── scheduler/main.go # Periodic job schedulingEach entry point:
- Validates only its required configuration
- Initializes only its required dependencies
- Has dedicated graceful shutdown handling
DI Container Separation
WorkerContainer:
├── Encryption adapter (OAuth token decryption)
├── Analysis handler (queue task processor)
├── Queue client (task consumption)
└── Shared: Database pool, PostgreSQL connection
SchedulerContainer:
├── Distributed lock (single-instance guarantee)
├── Scheduler handler (periodic job executor)
├── Queue client (task enqueuing)
└── Shared: Database pool, PostgreSQL connectionKey Principle: Worker container never initializes lock, scheduler container never initializes encryption.
Options Considered
Option A: Process Separation (Selected)
Description:
Separate binaries with dedicated entry points and DI containers.
Pros:
- Independent scaling (workers: 0-N, scheduler: exactly 1)
- Failure isolation (worker crash doesn't affect scheduling)
- Resource optimization (scheduler: minimal, workers: heavy)
- Clear security boundaries (encryption key only in workers)
- Build-time optimization (smaller scheduler binary)
Cons:
- Two deployment pipelines to maintain
- Configuration synchronization required
- More complex monitoring setup
Option B: Single Binary with Runtime Mode
Description:
Single binary that switches behavior based on environment variable or flag.
./worker --mode=worker
./worker --mode=schedulerPros:
- Single build artifact
- Shared codebase (less duplication)
- Simpler CI/CD pipeline
Cons:
- Binary includes unused code (scheduler loads worker deps in memory)
- Runtime misconfiguration risk (wrong mode deployed)
- Unclear from code inspection what service does
- Still requires separate deployment configurations
Option C: Combined Process
Description:
Single process runs both scheduler and worker in different goroutines.
Pros:
- Simplest deployment (one service)
- No inter-process communication overhead
- Single configuration file
Cons:
- Cannot scale components independently
- Resource waste (scheduler in every worker instance)
- Failure coupling (worker panic kills scheduler)
- Must provision for max(scheduler, worker) resources
Implementation Principles
Configuration Validation
Each process validates only its requirements:
Worker startup:
├── Check DATABASE_URL (required)
├── Check ENCRYPTION_KEY (required) ← Unique to worker
└── Fail fast if missing
Scheduler startup:
├── Check DATABASE_URL (required)
├── Initialize distributed lock ← Unique to scheduler
└── Fail fast if connection failsDistributed Lock Strategy
Scheduler uses PostgreSQL-based distributed lock to ensure single-instance execution:
Instance A: Acquires lock → Executes scheduled jobs
Instance B: Lock acquisition fails → Remains standby
Instance C: Lock acquisition fails → Remains standbyBenefits:
- High availability during blue-green deployments
- Automatic failover if active instance crashes
- Lock heartbeat extends TTL for long-running jobs
Queue-Based Communication
Scheduler and workers communicate exclusively through the message queue:
Scheduler ──[Enqueue Task]──> River Queue (PostgreSQL) ──[Dequeue Task]──> WorkerDecoupling Benefits:
- Scheduler doesn't wait for worker completion
- Workers process at their own pace
- Natural backpressure through queue depth
- No direct scheduler-worker network communication
Graceful Shutdown Handling
Each process has tailored shutdown behavior:
Worker Shutdown:
- Stop accepting new tasks from queue
- Wait for in-flight tasks (with configurable timeout)
- Close database/PostgreSQL connections
- Exit
Scheduler Shutdown:
- Stop cron scheduler (prevent new job triggers)
- Wait for current job completion (with timeout)
- Release distributed lock
- Close database/PostgreSQL connections
- Exit
Consequences
Positive
Independent Scaling:
- Scale workers based on queue depth without touching scheduler
- Scheduler stays minimal (single instance, low resources)
- PaaS auto-scaling applies only to workers
Cost Optimization:
- Scheduler: Fixed minimal resources (~0.25 vCPU, 256 MB)
- Workers: Scale 0-N based on demand
- Significant savings vs. running N combined instances
Failure Isolation:
- Worker memory exhaustion doesn't affect scheduling
- Scheduler issues don't prevent worker task processing
- Partial degradation instead of total outage
Deployment Independence:
- Update worker logic without scheduler downtime
- Change cron schedules without worker redeployment
- Different release cadences per component
Security Boundaries:
- Encryption keys confined to worker processes
- Scheduler operates with reduced privileges
- Clear audit trail per process type
Negative
Operational Complexity:
- Two services to monitor, deploy, and maintain
- Multiple CI/CD pipelines
- Log aggregation across services
Configuration Management:
- Shared task schemas require coordination
- Environment variable synchronization
- Queue naming consistency
Debugging Overhead:
- Cross-service request tracing
- Distributed log correlation
- Multiple dashboards to monitor
Technical Implications
| Aspect | Implication |
|---|---|
| Infrastructure | Separate PaaS services, shared PostgreSQL |
| Deployment | Independent release cycles, coordinated for contracts |
| Scaling | Workers: auto-scale, Scheduler: fixed single instance |
| Monitoring | Per-service metrics, unified queue depth monitoring |
| Blue-Green Deployments | Workers: overlapping, Scheduler: lock-based handoff |
References
- ADR-02: Clean Architecture Layers (Container separation)
- ADR-03: Graceful Shutdown (Lifecycle management)
- ADR-03: API and Worker Service Separation (Cross-cutting)
- ADR-04: Queue-Based Asynchronous Processing
