Applying pprof
import "github.com/gin-contrib/pprof"
// initialize server
var (
router *gin.Engine
)
{
router = gin.New()
// register pprof routes
pprof.Register(router, APIPrefix+"/debug/pprof")
}
Script
echo "User API v1.2.3"
# start profiling
curl https://{endpoint}/debug/pprof/trace?seconds=10 --output v1.2.3-trace.out &
curl https://{endpoint}/debug/pprof/heap?seconds=10 --output v1.2.3-heap.prof &
curl https://{endpoint}/debug/pprof/profile?seconds=10 --output v1.2.3-cpu.prof &
# Run load test while profiling
bombardier -t 1s -l -c 30 -d 10s \
-H "Content-Type: application/json" \
-m GET https://{endpoint}/v1/users/51387659
wait
# view pprof output locally
go tool trace -http 127.0.0.1:9094 v1.2.3-trace.out &
go tool pprof -http 127.0.0.1:9095 v1.2.3-heap.prof &
go tool pprof -http 127.0.0.1:9096 v1.2.3-cpu.prof &
Note: Bash &
runs commands asynchronously.
pprof Usage Steps
- Start the application with pprof routes registered.
- If the app isn’t under load, run a load test on a target API.
- Use curl to trigger profiling.
If you’re behind a Load Balancer, ensure profiling completes within the LB timeout.
Heap Profiling Metrics
inuse_objects
: number of currently used objectsinuse_space
: amount of memory currently in usealloc_objects
: total number of objects ever allocatedalloc_space
: total memory ever allocated
Notes on Inuse vs Alloc
- Same Inuse/Alloc (e.g. 1GB/1GB): object was allocated once and reused — no GC triggered, CPU efficient.
- Alloc much larger than Inuse (e.g. 1TB/1GB): indicates high churn and frequent GC — may cause Stop-The-World (STW) pauses.
Tuning Opportunities
Flame Graph
Flame Graphs help distinguish between user-defined logic and framework/internal code like ServeHTTP
.
The wider the bar, the more allocations are made.
Contrary to expectations, Redis accounted for only 7.9% of allocations.
Setting context values consumed 13.9% of heap allocations.
Each context Set
allocates a new map if the key doesn’t exist. Since no capacity is preallocated, this results in additional memory usage.
If the overhead for grouping logs by request ID is too high, consider removing it.
Allocation Graph
- Larger rectangles = higher memory usage.
- Prioritize tuning large allocations with high % usage.
Zerolog Overhead
Despite using a shared Logger
instance, allocations still occurred due to the use of With().Str().Logger()
on every request.
var logEntry zerolog.Logger = h.logger.With().
Str("trace_id", ctx.GetString(RequestIDContextKey)).Logger()
Although cleaner for readability, this approach allocates new memory on each invocation.
Solution: Accept the readability tradeoff or switch logging frameworks.
However, other loggers may allocate even more due to interface marshalling. Always benchmark alternatives.
GC Tuning
Goal: minimize STW (Stop-The-World) latency caused by excessive GC.
Using GOMEMLIMIT
GOGC
uses heap growth rate to trigger GC.- Default GOGC = 100 → GC triggers when heap doubles.
Tuning Based on Program Behavior
- Low GOGC → GC runs too often, especially early on when heap is small.
- High GOGC → Risk of OOM. E.g., with 40GB heap and GOGC=50, GC only runs after heap reaches 60GB.
GOMEMLIMIT caps memory usage before GC triggers. Set it below peak usage to reduce GC frequency.
GOMEMLIMIT is a soft limit — Go may allocate slightly more than configured.
Suggested Workflow:
- Start with default GOGC.
- Measure memory usage.
- Set GOMEMLIMIT to ~80% of peak.
- Monitor GC behavior.
Benchmarking Production Code
Use go test -benchmem
to evaluate memory allocations.
go test -bench=. -benchmem
Example benchmark of StructuredLogger
:
BenchmarkStructuredZeroLoggerMessage-12 596910 1724 ns/op 1000 B/op 30 allocs/op
BenchmarkStructuredZeroLoggerMessageWithFields-12 285715 3926 ns/op 1633 B/op 60 allocs/op
BenchmarkStructuredZeroLoggerMessageWithFieldsWithContext-12 222223 5346 ns/op 3418 B/op 68 allocs/op
BenchmarkStandardZeroLoggerMessage-12 11927823 90.17 ns/op 0 B/op 0 allocs/op
BenchmarkStandardZeroLoggerMessageWithDeterminedFields-12 5649648 217.6 ns/op 0 B/op 0 allocs/op
BenchmarkStandardZeroLoggerWithFields-12 300001 3894 ns/op 1553 B/op 59 allocs/op
Even with ZeroLogger, repeated use of WithFields
leads to significant allocations.
Use Case Driven Pointer Usage
While pointers are efficient due to passing references instead of values, Go may allocate them on the heap, unlike value types which may reside on the stack.
When to Use Pointers
- Large structs: avoid copying
- Mutability: need to modify original value
- Consistency: stick to pointer receivers for uniformity
- Nullable fields: to distinguish between zero value and nil
Preallocate Slice Capacity
Without preallocating, append()
causes reallocation on each operation.
Go can optimize stack allocation for small-sized slices:
MaxStackVarSize
: up to 10MB for explicitly declared varsMaxImplicitStackVarSize
: up to 64KB for implicit varsMaxSmallArraySize
: arrays <= 256 bytes are stack-allocated