For how to apply pprof, refer to Tuning GC with pprof

Background

Our service server was performing one HTTP request per event—triggered frequently and handled asynchronously. However, the more requests we sent, the more linear the delay became.

Problems

  1. Excessive network requests on every event
  2. Increased latency due to queuing
  3. Lack of scalability — implementation differences between services caused repeated maintenance issues

Architecture Improvement

Solutions

  1. Queue events locally and send batch requests when either buffer size or interval is exceeded.
  2. Reduce latency via batching.
  3. Build a common library with loose coupling between service servers and the batch loader.

Design Decisions

We decided to build a shared module called Batch Processor to manage event queues.

Requirements

  • Optimize for IO-bound tasks
  • Manage goroutine lifecycle cleanly

Option 1: Worker Pool (Sync IO)

Pros

  • Avoids deep copy overhead
  • Tunable performance via pool size

Cons

  • Potentially multiple HTTP requests per interval
  • Hard to tune optimal pool size for every service

Option 2: Single Worker + Async HTTP

Pros

  • Only one HTTP request per interval
  • Simpler integration without tuning

Cons

  • Minor CPU/memory overhead from deep copy
  • GC pressure may increase due to heap allocations

Given the trade-offs, we opted for Option 2, but needed to validate that deep copy overhead wouldn’t impact performance.

Profiling Goals

  • Measure throughput at 2000 RPS
  • Quantify memory impact of deepCopy()
  • Analyze GC overhead from buffer copies
func deepCopy[T any](src []T) []T {
    if src == nil {
        return nil
    }
    dst := make([]T, len(src))
    copy(dst, src)
    return dst
}

Methodology

Compare 3 implementations:

  • 10 Workers + Sync IO
  • 100 Workers + Sync IO
  • 1 Worker + Async IO

Track:

  • Throughput via logs
  • Heap profile with:
curl {endpoint}/debug/pprof/heap?seconds=30 --output {output_path}

Log parsing example:

2024-10-14T05:11:06Z INF count=1020
2024-10-14T05:11:07Z INF count=1000
2024-10-14T05:11:07Z INF stopping BatchProcessor...

CPU Profiling

100 Workers + Sync IO

  • 85% of time spent on sellock and acquireSudog
  • High contention on channel access

10 Workers + Sync IO

  • Lock contention reduced to 66%

1 Worker + Async IO

  • Deep copy overhead ~10ms
  • 50% time split between API calls and deep copy

Heap Profiling

Worker Pool

  • ~8.2MB total, ~7.7MB from HTTP
  • No deep copy impact

1 Worker + Async IO

  • ~12.2MB total, ~11.4MB from HTTP
  • Deep copy impact: 150kB (~1.22%) — negligible

Results

SetupThroughput/minCPU OverheadMemory Overhead
10 Workers83,66366%0%
100 Workers84,04285%0%
1 Worker + Async119,72050%1.22%

Conclusion

  • Worker pools introduce significant concurrency overhead
  • Increasing worker count doesn’t scale linearly
  • Async execution outperforms worker pools for IO-bound tasks
  • Worker pools remain ideal for CPU-bound tasks
  • When order matters, prefer worker pool even for IO-bound jobs