From 16 Seconds to 1.3: A Journey Through Performance Optimization - Part 1

Stefano
Programming
21 Oct, 2025

Recently, I decided to learn Rust - not just for career pivot reasons, but to understand if modern language design could actually deliver on the “zero-cost abstractions” promise.

My test? Build a market data aggregator that processes 10 million trades and computes OHLCV (Open/High/Low/Close/Volume) bars. First in C++, optimizing ruthlessly. Then in Rust, comparing performance.

What I discovered surprised me. But let’s start at the beginning.

The C++ Optimization Journey

Phase 1: The Naive Implementation (16.5 seconds on my linux machine)

I started with straightforward, functional-style C++ code:

1
// Read CSV line by line
2
std::ifstream file(filename);
3
std::string line;
4
std::map<uint64_t, std::vector<Trade>> windows;
5

6
while (std::getline(file, line)) {
7
    auto trade = parse_trade(line);  // Using std::stod, std::stoull
8
    auto window = to_window(trade.timestamp_ms);
9
    windows[window].push_back(trade);
10
}
11

12
// Aggregate using functional patterns
13
std::vector<OHLCV> bars;
14
std::transform(windows.begin(), windows.end(),
15
               std::back_inserter(bars),
16
               [](const auto& pair) {
17
                   return aggregate_trades(pair.first, pair.second);
18
               });

Result: 16,529ms to process 10M trades

This is slow. But why? If you create a flamegraph you can see where the time is spent but that doesn’t give you a direction to start moving in the right direction. Below in the interactive graph you can see that we spend a lot of time in the push_back function of the std::vector class but that time is well spent?

Phase 1 Flamegraph - Click sections to zoom, search with Ctrl+F

Understanding the Bottleneck: The Three Types of “Bound”

Before optimizing, I needed to understand what was actually slow. There are three main bottleneck categories:

1. I/O Bound - Waiting for disk/network

Symptoms: Low CPU usage, high context switches
Common in: File readers, network services
Solution: Async I/O, buffering, caching

2. Memory Bound - Waiting for RAM

Symptoms: High cache miss rate, normal CPU usage
Common in: Random access patterns, pointer chasing
Solution: Better data structures, prefetching, cache-friendly layouts

3. CPU Bound - Waiting for computation

Symptoms: 100% CPU, low cache misses
Common in: Heavy computation, string parsing
Solution: Better algorithms, SIMD, parallel processing

Investigation: What’s Actually Slow?

My initial hypothesis: “It’s reading a file, so it must be I/O bound.”

Let’s test with perf stat:

1
perf stat ./trade_aggregator trades.csv

1
16,665.35 msec task-clock    #  1.000 CPUs utilized
2
      643      context-switches  #   38.583 /sec
3
  132,344      page-faults       #    7.941 K/sec

Surprise #1: It’s NOT I/O bound!

100% CPU utilization (not waiting on I/O)
Low context switches (not blocking)
High page faults (lots of allocations)

Next hypothesis: “Is it memory-bound or compute-bound?” Let’s run perf to get stats about the stuff is interesting to know when investigating memory-bound processes.

1
perf stat -e cycles,instructions,cache-misses,cache-references ./trade_aggregator trades.csv

1
cache-misses:   29,597,938
2
cache-references: 108,614,903
3
→ 27.25% cache miss rate  ← MEMORY BOUND!

The smoking gun: 27% of cache accesses miss!

Cache miss interpretation:

< 5% = CPU doing work (compute-bound)
10-20% = Moderate memory pressure
27% = Significantly memory-bound ← We’re here
bigger than 30% = severe bottleneck

But which functions cause the misses?

1
perf record -e cache-misses -g ./trade_aggregator trades.csv
2
perf report

1
26% - std::vector::push_back
2
31% - std::map operations (tree traversal)

Root cause identified:

std::map<uint64_t, vector<Trade>> = scattered memory
Each window’s vector allocated separately
Tree rebalancing causes cache evictions
Random access pattern = cache thrashing

Phase 2: Memory Optimization (6.5 seconds - 2.5x faster!)

Armed with knowledge, I made targeted changes:

1
// 1. Hash table instead of tree (O(1) vs O(log n))
2
std::unordered_map<uint64_t, std::vector<Trade>> windows;
3

4
// 2. Pre-allocate to avoid reallocations
5
windows.reserve(200);  // Avoid rehashing
6

7
// 3. Reserve vector capacity
8
auto& window_trades = windows[window];
9
if (window_trades.empty()) {
10
    window_trades.reserve(100000);  // Avoid push_back reallocations
11
}
12

13
// 4. Single-pass aggregation (instead of multiple passes)
14
for (const auto& trade : trades) {
15
    if (trade.price < min_price) min_price = trade.price;
16
    if (trade.price > max_price) max_price = trade.price;
17
    total_volume += trade.volume;
18
}

Results:

1
Time: 16,529ms → 6,508ms (2.5x faster)
2
Instructions: 139B → 58B (2.4x fewer!)
3
Cache misses: 38.9M → 18.6M (2x fewer)

Key lesson: Data structure choice matters more than micro-optimizations.

Phase 3: Eliminate Parsing Overhead (1.8 seconds - 3.7x faster!)

The next bottleneck? String parsing with std::stod and std::stoull.

Two optimizations:

1. Memory-mapped I/O (instead of getline())

1
// Map entire file into memory
2
char* file_data = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
3
madvise(file_data, file_size, MADV_SEQUENTIAL);
4

5
// Parse directly from mapped buffer (zero-copy)
6
const char* ptr = file_data;
7
while (ptr < end) {
8
    parse_trade_mmap(ptr, trade);
9
}

2. Custom parsing (5x faster than std::stod)

1
inline double parse_double(const char*& ptr) {
2
    double result = 0.0;
3
    // Parse integer part
4
    while (*ptr >= '0' && *ptr <= '9') {
5
        result = result * 10.0 + (*ptr - '0');
6
        ++ptr;
7
    }
8
    // Parse decimal
9
    if (*ptr == '.') {
10
        ++ptr;
11
        double fraction = 0.1;
12
        while (*ptr >= '0' && *ptr <= '9') {
13
            result += (*ptr - '0') * fraction;
14
            fraction *= 0.1;
15
            ++ptr;
16
        }
17
    }
18
    return result;
19
}

Results:

1
Time: 6,508ms → 1,777ms (3.7x faster!)
2
Instructions: 58B → 16.5B (3.5x fewer!)
3
IPC: 1.97 → 2.03 (better CPU utilization)

Total improvement so far: 16.5s → 1.8s (9.3x speedup!)

Phase 4: SIMD Vectorization (1.7 seconds - 1.05x faster)

Final optimization: use AVX2 to process 4 doubles at once.

1
#include <immintrin.h>
2

3
OHLCV aggregate_trades_simd(uint64_t window, const std::vector<Trade>& trades) {
4
    __m256d vec_min = _mm256_set1_pd(trades[0].price);
5
    __m256d vec_max = _mm256_set1_pd(trades[0].price);
6
    __m256d vec_sum = _mm256_setzero_pd();
7

8
    // Process 4 trades at once
9
    for (size_t i = 0; i + 3 < n; i += 4) {
10
        __m256d prices = _mm256_set_pd(
11
            trades[i+3].price, trades[i+2].price,
12
            trades[i+1].price, trades[i].price
13
        );
14
        vec_min = _mm256_min_pd(vec_min, prices);
15
        vec_max = _mm256_max_pd(vec_max, prices);
16
        // ... accumulate volumes
17
    }
18
    // Horizontal reduction + handle remaining elements
19
}

Results:

1
Time: 1,777ms → 1,691ms (5% improvement)
2
IPC: 2.03 → 2.10 (better instruction throughput)

Why only 5%?

Amdahl’s Law: Aggregation is only ~10% of total time. Even 4x SIMD speedup on 10% = ~1.03x overall.

Most time is still in parsing (>90%), which is harder to vectorize.

The lesson: Optimize the bottleneck first. SIMD is powerful but can’t overcome architectural issues.

The Cache Miss Paradox

One counterintuitive finding throughout this journey:

Phase	Cache Miss Rate	Time
Phase 1	29%	16.5s
Phase 2	39%	6.5s
Phase 4	63%	1.7s (fastest!)

How can worse cache miss rate = better performance?

Because we’re doing fundamentally less work:

Phase 1: 139B instructions
Phase 4: 16.1B instructions (8.6x fewer!)

The insight: Absolute time spent on cache misses matters, not the percentage.

Even with higher miss rates, Phase 4 accesses memory 3.6x less often:

Phase 1: 110M cache references
Phase 4: 31M cache references

Better to miss 63% on 31M accesses than 29% on 110M accesses.

Final C++ Stats

1
Phase 1 → Phase 4: 16,529ms → 1,691ms (9.8x speedup)
2

3
Breakdown:
4
├─ Phase 2 (Memory): 2.5x (unordered_map, reserves)
5
├─ Phase 3 (Parsing): 3.7x (mmap, custom parsing)
6
└─ Phase 4 (SIMD): 1.05x (vectorization)
7

8
Performance counters (Phase 4):
9
- Cycles: 7.9B
10
- Instructions: 16.1B
11
- IPC: 2.05
12
- Cache miss rate: 63%

I was satisfied. 10M trades in 1.7 seconds felt good.

Then I ported it to Rust.

From 16 Seconds to 1.3: A Journey Through Performance Optimization - Part 1

The C++ Optimization Journey

Phase 1: The Naive Implementation (16.5 seconds on my linux machine)

Understanding the Bottleneck: The Three Types of “Bound”

Investigation: What’s Actually Slow?

Phase 2: Memory Optimization (6.5 seconds - 2.5x faster!)

Phase 3: Eliminate Parsing Overhead (1.8 seconds - 3.7x faster!)

Phase 4: SIMD Vectorization (1.7 seconds - 1.05x faster)

The Cache Miss Paradox

Final C++ Stats

Tags:

Share:

Related Posts

Dali operator for multi-page TIFF's

CLion with multiple CUDA SDKs

Homography for tensorflow