For how to apply
pprof
, refer to Tuning GC with pprof
Background
Our service server was performing one HTTP request per event—triggered frequently and handled asynchronously. However, the more requests we sent, the more linear the delay became.
Problems
- Excessive network requests on every event
- Increased latency due to queuing
- Lack of scalability — implementation differences between services caused repeated maintenance issues
Architecture Improvement
Solutions
- Queue events locally and send batch requests when either buffer size or interval is exceeded.
- Reduce latency via batching.
- Build a common library with loose coupling between service servers and the batch loader.
Design Decisions
We decided to build a shared module called Batch Processor to manage event queues.
Requirements
- Optimize for IO-bound tasks
- Manage goroutine lifecycle cleanly
Option 1: Worker Pool (Sync IO)
Pros
- Avoids deep copy overhead
- Tunable performance via pool size
Cons
- Potentially multiple HTTP requests per interval
- Hard to tune optimal pool size for every service
Option 2: Single Worker + Async HTTP
Pros
- Only one HTTP request per interval
- Simpler integration without tuning
Cons
- Minor CPU/memory overhead from deep copy
- GC pressure may increase due to heap allocations
Given the trade-offs, we opted for Option 2, but needed to validate that deep copy overhead wouldn’t impact performance.
Profiling Goals
- Measure throughput at 2000 RPS
- Quantify memory impact of
deepCopy()
- Analyze GC overhead from buffer copies
func deepCopy[T any](src []T) []T {
if src == nil {
return nil
}
dst := make([]T, len(src))
copy(dst, src)
return dst
}
Methodology
Compare 3 implementations:
- 10 Workers + Sync IO
- 100 Workers + Sync IO
- 1 Worker + Async IO
Track:
- Throughput via logs
- Heap profile with:
curl {endpoint}/debug/pprof/heap?seconds=30 --output {output_path}
Log parsing example:
2024-10-14T05:11:06Z INF count=1020
2024-10-14T05:11:07Z INF count=1000
2024-10-14T05:11:07Z INF stopping BatchProcessor...
CPU Profiling
100 Workers + Sync IO
- 85% of time spent on
sellock
andacquireSudog
- High contention on channel access
10 Workers + Sync IO
- Lock contention reduced to 66%
1 Worker + Async IO
- Deep copy overhead ~10ms
- 50% time split between API calls and deep copy
Heap Profiling
Worker Pool
- ~8.2MB total, ~7.7MB from HTTP
- No deep copy impact
1 Worker + Async IO
- ~12.2MB total, ~11.4MB from HTTP
- Deep copy impact: 150kB (~1.22%) — negligible
Results
Setup | Throughput/min | CPU Overhead | Memory Overhead |
---|---|---|---|
10 Workers | 83,663 | 66% | 0% |
100 Workers | 84,042 | 85% | 0% |
1 Worker + Async | 119,720 | 50% | 1.22% |
Conclusion
- Worker pools introduce significant concurrency overhead
- Increasing worker count doesn’t scale linearly
- Async execution outperforms worker pools for IO-bound tasks
- Worker pools remain ideal for CPU-bound tasks
- When order matters, prefer worker pool even for IO-bound jobs