From 5 Seconds to 480ms: A Performance Optimization Journey

Stefano
Programming
25 Oct, 2025

Act 1: The C++ Journey - Learning Through Measurement

The goal was simple: read a CSV file containing 10 million trades, aggregate them into OHLCV (Open/High/Low/Close/Volume) bars by time window, and do it as fast as possible.

This is the story of how I optimized the same problem in both C++ and Rust. It’s a journey of failures and successes that demonstrates why the process of discovery often matters more than the final destination.

Phase 1: The Naive Implementation (4975ms) - “This Should Be Fast!”

I started with what felt natural: clean, semi-functional C++ using standard library components. After all, C++ is supposed to be fast, right?

1
// Phase 1: Functional style with iostreams
2
std::optional<Trade> parse_trade(const std::string& line) {
3
    std::istringstream iss(line);
4
    std::string token;
5
    std::vector<std::string> tokens;
6

7
    while (std::getline(iss, token, ',')) {
8
        tokens.push_back(token);
9
    }
10

11
    if (tokens.size() != 3) return std::nullopt;
12

13
    try {
14
        return Trade{
15
            std::stoull(tokens[0]),
16
            std::stod(tokens[1]),
17
            std::stod(tokens[2])
18
        };
19
    } catch (...) {
20
        return std::nullopt;
21
    }
22
}
23

24
// Aggregate using ranges
25
OHLCV aggregate_trades(uint64_t window, const std::vector<Trade>& trades) {
26
    auto prices = trades
27
        | std::views::transform([](const Trade& t) { return t.price; });
28

29
    auto total_volume = std::accumulate(
30
        trades.begin(), trades.end(), 0.0,
31
        [](double sum, const Trade& t) { return sum + t.volume; }
32
    );
33

34
    return OHLCV{
35
        window,
36
        trades.front().price,
37
        *std::ranges::max_element(prices),
38
        *std::ranges::min_element(prices),
39
        trades.back().price,
40
        total_volume
41
    };
42
}

Result: 4975ms

Not terrible, but far from great. It was time to profile.

I ran perf record and the results didn’t surprise me:

1
$ perf stat ./trade_aggregator trades.csv
2
Processed 101 bars in 4835ms
3

4
Performance counter stats:
5
    4,849.75 msec task-clock        # 1.000 CPUs utilized
6
          37      context-switches  # Very low, indicating it's CPU-bound
7
  49,556,918,037  instructions      # 2.26 insn per cycle

Phase 2: The Obvious Fix That Wasn’t (5025ms) - “Hash Tables Are Faster!”

Everyone knows hash tables offer O(1) average-case insertion versus O(log n) for tree-based maps. I confidently swapped std::map for std::unordered_map:

1
// Surely this will be faster!
2
std::unordered_map<uint64_t, std::vector<Trade>> windows;

Result: 5025ms

Worse. Actually, slightly worse. Why did this fail?

With only 101 distinct time windows:

std::map: O(log 101) ≈ 7 comparisons per insertion.
std::unordered_map: Involves hash computation, potential collisions, and occasional rehashing.

The overhead of the hashing function, combined with the fact that the output still needed to be sorted by timestamp, cost more than the simple tree operations.

💡 Lesson 1: Context matters. With a small number of keys, “obvious” algorithmic optimizations can backfire.

A great example of “asymptotic complexity” vs. “real-world performance.” The constant factors in hashing and memory access for the unordered_map outweighed the logarithmic complexity of std::map for this specific dataset size.

I should have trusted the profiler from the start:

1
parse_trade: 80% of runtime
2
  ├─ strtod: 27.8%
3
  ├─ getline (internal): 8.78%
4
  ├─ istringstream::init: 6.56%  ← Creating stream objects is expensive!
5
  ├─ operator new: 6.13%         ← Constant memory allocations
6
  └─ strtoull: 3.75%

Following an itch instead of data is a classic performance-tuning mistake. The profiler is your most honest friend.

Phase 3: The Real Win (2586ms) - “Attack the Bottleneck”

The flamegraph was clear: 80% of the runtime was spent parsing, with istringstream creation alone consuming 6.56%. It was time to eliminate that overhead.

1
// Custom parser - no istringstream, direct pointer arithmetic
2
std::optional<Trade> parse_trade(const std::string& line) {
3
    const char* ptr = line.c_str();
4
    char* end;
5

6
    // Parse timestamp
7
    uint64_t timestamp = std::strtoull(ptr, &end, 10);
8
    if (*end != ',') return std::nullopt;
9

10
    // Parse price
11
    ptr = end + 1;
12
    double price = std::strtod(ptr, &end);
13
    if (*end != ',') return std::nullopt;
14

15
    // Parse volume
16
    ptr = end + 1;
17
    double volume = std::strtod(ptr, &end);
18

19
    return Trade{timestamp, price, volume};
20
}

Result: 2586ms. A 1.92x speedup!

1
$ perf stat ./trade_aggregator_phase3 trades.csv
2
Processed 101 bars in 2586ms
3

4
Performance counter stats:
5
    11,733,144,303  cycles        # Half the cycles!
6
    24,917,028,826  instructions  # Half the instructions!
7
             2.12   insn per cycle

The new flamegraph confirmed the win:

1
main: 93.7%
2
  ├─ strtod: 53.39%  ← Now the dominant cost
3
  ├─ getline: 9.05%
4
  └─ other: ~31%

By removing the istringstream overhead, we halved the cycle and instruction counts. Now, strtod (the standard C function for parsing doubles) was the clear bottleneck.

💡 Lesson 2: Profile, identify the true bottleneck, and attack it relentlessly.

Phase 4: The SIMD Temptation (2614ms) - “But AVX2 Should Help!”

Looking at aggregate_trades, I saw an opportunity: finding min, max, and sum over thousands of prices per window. This seemed like a perfect use case for SIMD (Single Instruction, Multiple Data).

1
OHLCV aggregate_trades_simd(uint64_t window, const std::vector<Trade>& trades) {
2
    // ... setup code ...
3

4
    __m256d vec_min = _mm256_set1_pd(trades[0].price);
5
    __m256d vec_max = _mm256_set1_pd(trades[0].price);
6
    __m256d vec_sum = _mm256_setzero_pd();
7

8
    // Process 4 doubles at a time
9
    for (; i + 4 <= n; i += 4) {
10
        __m256d vec_prices = _mm256_loadu_pd(&price_data[i]);
11
        __m256d vec_vols = _mm256_loadu_pd(&volumes[i]);
12

13
        vec_min = _mm256_min_pd(vec_min, vec_prices);
14
        vec_max = _mm256_max_pd(vec_max, vec_prices);
15
        vec_sum = _mm256_add_pd(vec_sum, vec_vols);
16
    }
17

18
    // Horizontal reduction to get final values...
19
}

Result: 2614ms

No improvement. In fact, it was slightly worse in some runs.

1
$ perf stat ./trade_aggregator_phase4 trades.csv
2
Processed 101 bars in 2614ms
3

4
Performance counter stats:
5
    11,738,495,430  cycles        # Same as Phase 3
6
    24,991,458,484  instructions  # Slightly MORE instructions

Why did SIMD fail?

The profiler told the same story: aggregation was never the bottleneck. Parsing still dominated the runtime at over 50%. The marginal gains from faster aggregation were completely lost in the noise. Worse, the overhead of copying data into contiguous arrays to prepare for SIMD operations cost more than what was saved.

💡 Lesson 3: Optimize the bottleneck, not just what seems “optimizable.” Clever optimizations in the wrong place are a waste of effort.

At this point, the C++ Phase 3 result of 2586ms felt like a practical limit. I had achieved a nearly 2x improvement through systematic, measurement-driven optimization. It was time to try a different approach.

Act 2: Enter Rust - The Surprise

Phase 1: Straightforward, Idiomatic Rust

I spent some time writing a direct, idiomatic Rust equivalent. No unsafe code, no clever tricks—just clean, standard Rust.

1
fn parse_trade(line: &str) -> Option<Trade> {
2
    let parts: Vec<&str> = line.split(',').collect();
3

4
    if parts.len() != 3 {
5
        return None;
6
    }
7

8
    let timestamp_ms = parts[0].parse::<u64>().ok()?;
9
    let price = parts[1].parse::<f64>().ok()?;
10
    let volume = parts[2].parse::<f64>().ok()?;
11

12
    Some(Trade {
13
        timestamp_ms,
14
        price,
15
        volume,
16
    })
17
}
18

19
fn aggregate_trades(window: u64, trades: &[Trade]) -> OHLCV {
20
    let open = trades.first().unwrap().price;
21
    let close = trades.last().unwrap().price;
22

23
    let mut high = f64::NEG_INFINITY;
24
    let mut low = f64::INFINITY;
25
    let mut total_volume = 0.0;
26

27
    for trade in trades {
28
        high = high.max(trade.price);
29
        low = low.min(trade.price);
30
        total_volume += trade.volume;
31
    }
32

33
    OHLCV { window_start: window, open, high, low, close, volume: total_volume }
34
}

Note: Using f64::NEG_INFINITY and f64::INFINITY is slightly more robust than seeding with the first element, as it handles empty slices gracefully (though unwrap() would panic here anyway).

Result: 1445ms

This was 1.79x faster than my optimized C++.

1
$ perf stat ./trade_aggregator_rust trades.csv
2
Processed 101 bars in 1445ms
3

4
Performance counter stats:
5
     6,598,721,683  cycles        # 44% fewer than C++!
6
    17,171,967,642  instructions  # 31% fewer than C++!
7
             2.60   insn per cycle  # Better IPC (Instructions Per Cycle)

The Investigation: Why Was Rust So Much Faster Out of the Box?

The flamegraph revealed the secret:

1
main: 84%
2
  ├─ core::num::dec2flt::from_str: 14.93%  ← Float parsing!
3
  ├─ malloc: 4.79%
4
  └─ other: ~64%

Compare this to the C++ version:

C++ Phase 3: 53% of runtime in strtod.
Rust Phase 1: 15% of runtime in its float parser.

Rust’s standard library float parser was significantly more efficient than the glibc strtod implementation used by my C++ compiler. Same fundamental operation, but a vastly different level of performance in their standard library implementations.

💡 Lesson 4: The quality of a language’s standard library matters. Sometimes, a better “default” beats manual optimization effort.

Phases 2-4: Minor Rust Optimizations (1253ms)

I applied a few more improvements to the Rust version:

Phase 2: Better Allocation Strategy (1383ms) By pre-allocating memory for vectors and using a buffered reader, we reduce the number of system calls and reallocations.

1
// Reserve capacity, use a buffered reader
2
let mut windows: BTreeMap<u64, Vec<Trade>> = BTreeMap::new();
3
let reader = BufReader::with_capacity(64 * 1024, file);
4
windows.entry(window)
5
    .or_insert_with(|| Vec::with_capacity(100_000))
6
    .push(trade);

Phase 3: The fast-float Crate (1253ms) Swapping the standard parser for a specialized, highly optimized library yielded another significant gain.

1
use fast_float::parse;
2

3
let price: f64 = parse(parts.next()?).ok()?;
4
let volume: f64 = parse(parts.next()?).ok()?;

Phase 4: The lexical Crate (1327ms) Interestingly, another popular parsing crate was slightly slower in this specific benchmark.

The best configuration (Phase 3) achieved 1253ms, making it 2.1x faster than the optimized C++.

1
$ perf stat ./phase3 trades.csv
2
Processed 101 bars in 1253ms
3

4
Performance counter stats:
5
     5,684,892,977  cycles
6
    16,596,081,812  instructions
7
             2.92   insn per cycle  # Excellent IPC!

At this point, I thought the story was over. “Rust wins because of better, safer defaults.” A nice, clean narrative. But I was wrong.

Act 3: Plot Twist - C++ Strikes Back

Phase 5: The Nuclear Option (490ms) - Going All In

What if I stopped being polite and optimized the C++ version without constraints?

The approach:

Memory-mapped I/O (mmap): Eliminate file I/O overhead by mapping the file directly into memory.
Custom Integer/Float Parsers: Write parsers that handle only the expected format, stripping all error handling, locale support, and edge cases.
Aggressive Pre-allocation: Use an unordered_map again, but this time with enough reserved capacity to prevent rehashing.

1
// Custom ultra-fast parsers - no error handling, assumes perfect format
2
inline uint64_t parse_uint64(const char*& ptr) { /* ... */ }
3
inline double parse_double(const char*& ptr) { /* ... */ }
4

5
// Memory map the entire file
6
int fd = open(argv[1], O_RDONLY);
7
struct stat sb;
8
fstat(fd, &sb);
9
char* file_data = static_cast<char*>(
10
    mmap(nullptr, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0)
11
);
12
// Tell the kernel we will read this sequentially
13
madvise(file_data, sb.st_size, MADV_SEQUENTIAL);
14

15
// Aggressive pre-allocation
16
std::unordered_map<uint64_t, std::vector<Trade>> windows;
17
windows.reserve(500); // Prevent rehashing
18

19
// Parse directly from the memory map
20
const char* ptr = file_data;
21
while (ptr < end_ptr) {
22
    if (parse_trade_mmap(ptr, trade)) {
23
        auto& window_trades = windows[window];
24
        if (window_trades.empty()) {
25
            window_trades.reserve(150000); // Avoid reallocations
26
        }
27
        window_trades.push_back(trade);
28
    }
29
}

Result: 490ms

This was 10.2x faster than the original C++ code and 5.4x faster than the previously optimized version.

1
$ hyperfine --warmup 3 --runs 10 './trade_aggregator_phase5 trades.csv'
2
Time (mean ± σ):  490.2 ms ± 1.6 ms
3
Range (min … max): 488.6 ms … 493.6 ms

The C++ comeback was complete. By stripping away every abstraction, safety check, and convenience, we achieved blistering performance. But at what cost? Look at that code. It’s brittle, platform-specific, and assumes a perfect input format. This is not maintainable code.

💡 Lesson 5: Extreme performance often requires extreme trade-offs in safety, portability, and maintainability.

Act 4: The Final Plot Twist - Rust’s Answer

Phase 5: Same Weapons, Better Results (480ms)

If C++ could go nuclear, so could Rust. I ported the exact same low-level approach.

1
use memmap2::Mmap;
2
use std::collections::HashMap;
3

4
// Custom parsers matching C++ logic, operating on byte slices
5
#[inline(always)]
6
fn parse_u64(bytes: &[u8], start: &mut usize) -> Option<u64> { /* ... */ }
7

8
#[inline(always)]
9
fn parse_f64(bytes: &[u8], start: &mut usize) -> Option<f64> { /* ... */ }
10

11
// Memory-map and parse
12
let file = File::open(&args[1]).expect("Failed to open file");
13
let mmap = unsafe { Mmap::map(&file).expect("Failed to mmap file") };
14

15
// Use libc crate to call madvise
16
#[cfg(unix)]
17
unsafe {
18
    libc::madvise(
19
        mmap.as_ptr() as *mut libc::c_void,
20
        mmap.len(),
21
        libc::MADV_SEQUENTIAL,
22
    );
23
}
24

25
let mut windows: HashMap<u64, Vec<Trade>> = HashMap::with_capacity(500);
26

27
let bytes = &mmap[..];
28
let mut pos = 0;
29

30
while pos < bytes.len() {
31
    if let Some(trade) = parse_trade_mmap(bytes, &mut pos) {
32
        let window = to_window(trade.timestamp_ms, 1000);
33
        windows.entry(window)
34
            .or_insert_with(|| Vec::with_capacity(150_000))
35
            .push(trade);
36
    }
37
}

Result: 480ms

1
$ hyperfine --warmup 3 --runs 10 './phase5 trades.csv'
2
Time (mean ± σ):  479.8 ms ± 3.7 ms
3
Range (min … max): 475.6 ms … 487.7 ms

Rust wins by a margin of 2.1%.

When applying the same aggressive, low-level techniques, both languages perform almost identically, with Rust having a slight edge.

Act 5: The Real Lesson - Engineering Judgment

The Complete Journey

Phase	Language	Time	Speedup (vs. C++ P1)	Code Quality
C++ Journey
Phase 1: iostreams	C++	4975ms	1.0x	⭐⭐⭐⭐⭐ Maintainable
Phase 2: unordered_map	C++	5025ms	0.99x	⭐⭐⭐⭐⭐ Maintainable
Phase 3: custom parser	C++	2637ms	1.89x	⭐⭐⭐⭐ Good
Phase 4: SIMD	C++	2626ms	1.90x	⭐⭐⭐ Fair
Phase 5: mmap+custom	C++	490ms	10.2x	⭐ Fragile
Rust Journey
Phase 1: idiomatic	Rust	1445ms	3.44x	⭐⭐⭐⭐⭐ Excellent
Phase 2: optimized alloc	Rust	1383ms	3.60x	⭐⭐⭐⭐⭐ Excellent
Phase 3: fast-float	Rust	1253ms	3.97x	⭐⭐⭐⭐⭐ Excellent
Phase 4: lexical	Rust	1327ms	3.75x	⭐⭐⭐⭐⭐ Excellent
Phase 5: mmap+custom	Rust	480ms	10.4x	⭐⭐ Fragile (but safer)

The code quality rating for Rust’s Phase 5 is slightly higher because even with unsafe, the blast radius is more contained, and the rest of the language’s safety features still apply.

The Sweet Spot: Rust Phase 3 (1253ms)

Here’s the uncomfortable truth: the fastest code (Phase 5) is rarely the best code.

Why Phase 5 is problematic for most real-world applications:

Limited Correctness: The custom parsers are extremely brittle. They don’t support scientific notation, proper infinity/NaN handling, or different locales, and would break on trivial format variations.
Platform-Specific: mmap and madvise behave differently across operating systems.
Maintenance Nightmare: Manual pointer manipulation (in C++) and unsafe blocks (in Rust) are hard to reason about, easy to get wrong, and create security risks.
Marginal Real-World Benefit: The 770ms saved between Rust Phase 3 (1253ms) and Rust Phase 5 (480ms) would be completely dwarfed by network latency (1-50ms) or database queries (10-100ms) in a real system.

The Engineering Decision: Choose Rust Phase 3

1
// Uses the battle-tested and correct fast-float crate
2
use fast_float::parse;
3

4
fn parse_trade(line: &str) -> Option<Trade> {
5
    let mut parts = line.split(',');
6

7
    let timestamp_ms = parts.next()?.parse::<u64>().ok()?;
8
    let price: f64 = parse(parts.next()?)?;
9
    let volume: f64 = parse(parts.next()?)?;
10

11
    Some(Trade { timestamp_ms, price, volume })
12
}

Why this is the right choice for 99% of use cases:

✅ Fast Enough (1253ms): Still nearly 4x faster than the original C++ and 2x faster than the reasonably optimized C++.
✅ Production-Ready: It correctly handles edge cases and is cross-platform.
✅ Maintainable: The code is clear, concise, and relies on a well-tested library.
✅ Safe: It avoids unsafe blocks and manual memory management.
✅ Extendable: It’s easy to modify and build upon.

Conclusion

This was never truly a “Rust vs. C++” story. It’s a story about how systematic, measurement-driven engineering beats language dogma every time.

The Real Takeaways

Measure, Don’t Assume: My initial assumptions about bottlenecks (std::map, I/O) were all wrong. The profiler was the only source of truth.
“Best Practices” Are Context-Dependent: A tree-based map beat a hash table. SIMD was useless. The “best” tool always depends on the specific constraints of the problem.
The Approach Matters More Than the Language: Both languages saw a ~10x speedup when the same aggressive, low-level techniques were applied. The final performance difference was negligible.
Know When to Stop: The point of diminishing returns is real. The Phase 5 code offers ultimate performance but is fragile and hard to maintain. The “sweet spot” (Phase 3) provides excellent performance with production-ready code.

The Numbers That Tell the Story

1
C++ Journey:
2
From:     4975ms (naive)
3
To:        490ms (extreme)
4
Sweet Spot: 2637ms (practical)
5

6
Rust Journey:
7
From:     1445ms (idiomatic)
8
To:        480ms (extreme)
9
Sweet Spot: 1253ms (practical)
10

11
Final Recommendation: 1253ms (Rust Phase 3)
12
    - 2x faster than practical C++
13
    - Production-ready and safe
14
    - Highly maintainable
15
    - The right engineering trade-off

Epilogue: The Process

Profile, measure, identify the bottleneck, understand the trade-offs, and make an informed decision. That is the path to truly performant software.