Advent of Code 2019 in 110 ms: optimizing Intcode

This is part of a series of posts, starting with Advent of Code 2019 in 130 ms: introduction.

I realize that throughout the previous posts I promised to get back to performance optimization for the Intcode engine, but never did. This is mostly due to how these optimizations are very language- and implementation specific, which didn’t fit well with the algorithmic focus of the rest of the series. But in the interest of completeness, I’ll go through that here.

For this runtime benchmark, I’ll be using the solutions for days 19, 21, 23 and 25 as those are quite heavy on Intcode, as well as a few additional pure Intcode test cases. The format here will be closer to a lightly edited Git log than the other posts: I’ll go through the change log for my Intcode module, pick out the changes that made noteworthy performance improvements, and present those changes along with the associated benchmark. Because I was also improving the solutions on their own, I’ll be presenting the benchmarks for both before and after each change.

Change 1: Use Vec::extend instead of append

My Intcode machine uses a Vec for the machine memory, and dynamically expands it whenever the program writes to an address outside the current range of the vector. Before this change, this expansion was done using this code snippet:

if size >= prog.len() {
    prog.append(&mut (0..=0).cycle().take(size - prog.len() + 1).collect());
}

This uses .cycle() to generate an infinite series of zeroes, then .take() to limit the series to the number of additional elements we need, uses this sequence to construct a Vec of zeroes, and finally appends all items from that vector to the Intcode memory vector.

The update changes this to:

if size >= prog.len() {
    prog.extend((0..=0).cycle().take(size - prog.len() + 1));
}

which simply skips constructing the intermediate Vec. This saves a small amount of work and makes a small but measurable runtime improvement.

Benchmark before:

HEAD is now at 1960dae Eliminate unnecessary packet_queues variable
day19                         ...  38,981,295 ns (+/- 436,571)
day21                         ...  28,319,936 ns (+/- 1,245,536)
day23                         ... 180,519,980 ns (+/- 3,075,031)
day25                         ... 114,856,086 ns (+/- 1,681,529)
intcode::day13_emlun          ...  25,581,262 ns (+/- 506,558)
intcode::day9_example_1_clone ...       3,096 ns (+/- 54)
intcode::day9_example_1_new   ...       3,015 ns (+/- 126)
intcode::day9_example_2_clone ...          97 ns (+/- 4)
intcode::day9_example_2_new   ...         108 ns (+/- 1)
intcode::day9_example_3_clone ...          61 ns (+/- 0)
intcode::day9_example_3_new   ...          62 ns (+/- 1)

Benchmark after:

HEAD is now at 97c4808 Use Vec::extend instead of append
day19                         ...  33,946,146 ns (+/- 325,246)
day21                         ...  26,049,742 ns (+/- 564,147)
day23                         ... 175,287,999 ns (+/- 2,670,819)
day25                         ... 101,093,552 ns (+/- 1,836,310)
intcode::day13_emlun          ...  23,006,113 ns (+/- 270,564)
intcode::day9_example_1_clone ...       3,085 ns (+/- 60)
intcode::day9_example_1_new   ...       3,103 ns (+/- 108)
intcode::day9_example_2_clone ...         108 ns (+/- 4)
intcode::day9_example_2_new   ...         115 ns (+/- 2)
intcode::day9_example_3_clone ...          64 ns (+/- 1)
intcode::day9_example_3_new   ...          72 ns (+/- 0)

Change 2: Use Vec::resize instead of extend

This change directly follows the previous, so the benchmark before this change is the same as the benchmark after the previous. The change is another small improvement to the same function, updating the same code snippet to:

if size >= prog.len() {
    prog.resize(size, 0);
}

This instead uses the method Vec::resize(size, value) which is meant for this exact use case. This skips constructing the iterator of zeroes, and again makes a small but measurable improvement.

Benchmark after:

HEAD is now at c233afc Use Vec::resize instead of extend
day19                         ...  33,644,412 ns (+/- 452,858)
day21                         ...  25,756,035 ns (+/- 345,314)
day23                         ... 169,920,762 ns (+/- 2,603,704)
day25                         ... 104,185,852 ns (+/- 972,813)
intcode::day13_emlun          ...  23,379,818 ns (+/- 299,100)
intcode::day9_example_1_clone ...       2,828 ns (+/- 123)
intcode::day9_example_1_new   ...       2,855 ns (+/- 156)
intcode::day9_example_2_clone ...         105 ns (+/- 4)
intcode::day9_example_2_new   ...         110 ns (+/- 5)
intcode::day9_example_3_clone ...          66 ns (+/- 4)
intcode::day9_example_3_new   ...          71 ns (+/- 3)

Change 3: Use match expression instead of i64::pow

Intcode instructions include parameter modes for each parameter, which is stored in the 1000s decimal digit for the first parameter, the 10,000s digit for the second parameter, and so on. I initially used exponentiation to compute the mask for the parameter mode; this change instead replaces that with a match expression with explicit mappings for the first 3 parameters:

- let parmode_pow = 10_i64.pow((offset + 1) as u32);
+ let parmode_pow = match offset {
+     1 => 100,
+     2 => 1000,
+     3 => 10000,
+     _ => unreachable!(),
+ };

This saves a bit of time since exponentiation is a relatively costly operation. The difference is most noticeable for day 23 and 25:

Benchmark before:

HEAD is now at ba889a9 Use lossy conversions instead of try_into()
day19                         ...  33,994,137 ns (+/- 769,221)
day21                         ...  24,832,666 ns (+/- 808,367)
day23                         ... 180,145,616 ns (+/- 1,220,217)
day25                         ... 101,699,713 ns (+/- 1,994,839)
intcode::day13_emlun          ...  22,439,123 ns (+/- 1,059,112)
intcode::day9_example_1_clone ...       2,910 ns (+/- 94)
intcode::day9_example_1_new   ...       2,891 ns (+/- 104)
intcode::day9_example_2_clone ...         102 ns (+/- 4)
intcode::day9_example_2_new   ...         110 ns (+/- 5)
intcode::day9_example_3_clone ...          69 ns (+/- 3)
intcode::day9_example_3_new   ...          75 ns (+/- 3)

Benchmark after:

HEAD is now at d089d6e Use match expression instead of i64::pow
day19                         ...  31,870,599 ns (+/- 212,298)
day21                         ...  23,383,442 ns (+/- 1,349,296)
day23                         ... 167,453,722 ns (+/- 2,373,542)
day25                         ...  94,718,539 ns (+/- 1,589,987)
intcode::day13_emlun          ...  20,048,269 ns (+/- 685,314)
intcode::day9_example_1_clone ...       2,633 ns (+/- 57)
intcode::day9_example_1_new   ...       2,598 ns (+/- 83)
intcode::day9_example_2_clone ...          98 ns (+/- 7)
intcode::day9_example_2_new   ...          97 ns (+/- 5)
intcode::day9_example_3_clone ...          58 ns (+/- 2)
intcode::day9_example_3_new   ...          59 ns (+/- 1)

Change 4: Replace get_args with get_arg

This was one of the really significant changes. Before the change, I had an internal function get_args(num), which parses out the values of num parameters for the current instruction and returns the result as a Vec. The change replaces this with the function get_arg(arg_num) which only parses out the one parameter identified by arg_num. Humble as it may seem, this saves a lot of time - about 60% - because it eliminates an intermediate Vec for every instruction, a Vec which never had more than 2 elements.

Benchmark before:

HEAD is now at d6fe561 Eliminate method IntcodeComputer::is_halted
day19                         ...  31,773,265 ns (+/- 429,850)
day21                         ...  22,885,367 ns (+/- 1,535,141)
day23                         ... 166,465,638 ns (+/- 1,867,714)
day25                         ...  95,685,174 ns (+/- 1,363,361)
intcode::day13_emlun          ...  20,137,841 ns (+/- 346,307)
intcode::day9_example_1_clone ...       2,618 ns (+/- 84)
intcode::day9_example_1_new   ...       2,607 ns (+/- 34)
intcode::day9_example_2_clone ...          96 ns (+/- 3)
intcode::day9_example_2_new   ...          97 ns (+/- 1)
intcode::day9_example_3_clone ...          59 ns (+/- 0)
intcode::day9_example_3_new   ...          61 ns (+/- 1)

Benchmark after:

HEAD is now at 38366c8 Replace get_args with get_arg
day19                         ...  13,743,595 ns (+/- 208,007)
day21                         ...  10,213,446 ns (+/- 434,783)
day23                         ...  69,750,626 ns (+/- 515,868)
day25                         ...  37,981,223 ns (+/- 1,256,944)
intcode::day13_emlun          ...   8,250,930 ns (+/- 229,193)
intcode::day9_example_1_clone ...       1,130 ns (+/- 21)
intcode::day9_example_1_new   ...       1,133 ns (+/- 105)
intcode::day9_example_2_clone ...          61 ns (+/- 0)
intcode::day9_example_2_new   ...          67 ns (+/- 1)
intcode::day9_example_3_clone ...          45 ns (+/- 2)
intcode::day9_example_3_new   ...          51 ns (+/- 1)

Change 5: Integrate input/output buffers into IntcodeComputer

This change is really two commits, but they’re tightly connected as the first makes the second possible, and while the first on its own makes little difference, the second makes a huge difference for day 23.

Initially, my Intcode machine didn’t have integrated buffers for input and output. Instead, the step method took the input for the current step as a parameter, and returned the output as an Option value. The input would still only be consumed if the instruction needed it. Apart from this, there was a run method that accepted an input sequence and ran the program to completion, with no option to modify input based on output feedback. For that purpose, there was a few variations of a run_with function, which takes an initial state and a reducer function, and runs the program to completion with that reducer. After each step, the reducer is called with the current state and any Intcode output, and returns a new state and any Intcode input for the next step. This all works well enough, but it is a pretty roundabout way to do things just because I wanted to try doing it using functional programming techniques.

The first of these two commits adds integrated input and output buffers, and the second adds a run_mut method which just runs the program until it halts or until more input is needed. Before this, my solution for day 23 ran all the Intcode computers one step at a time, then did all the logic for the challenge, then looped back for one more step. With the new run_mut method, it instead keeps running each computer until it needs more input. This alone turns out to reduce runtime from ~70 ms to ~2.9 ms.

Benchmark before the first commit:

HEAD is now at 33fadc5 Use .iter().enumerate() instead of index range
day19                                   ... 13,661,870 ns (+/- 358,065)
day21                                   ... 10,072,786 ns (+/- 516,722)
day23                                   ... 69,739,826 ns (+/- 1,497,667)
day25                                   ... 37,456,243 ns (+/- 914,412)
intcode::day13_emlun                    ...  8,204,654 ns (+/- 84,022)
intcode::day9_example_1_clone           ...      1,152 ns (+/- 18)
intcode::day9_example_1_new             ...      1,104 ns (+/- 19)
intcode::day9_example_2_clone           ...         52 ns (+/- 0)
intcode::day9_example_2_new             ...         54 ns (+/- 0)
intcode::day9_example_3_clone           ...         39 ns (+/- 1)
intcode::day9_example_3_new             ...         39 ns (+/- 1)
intcode_iagueqnar::ackermann_3_6        ... 15,092,942 ns (+/- 56,588)
intcode_iagueqnar::factor_19338240      ...  2,983,820 ns (+/- 32,823)
intcode_iagueqnar::factor_2147483647    ... 67,799,823 ns (+/- 762,711)
intcode_iagueqnar::sum_of_primes_100000 ... 25,967,959 ns (+/- 200,632)

Benchmark after the first commit:

HEAD is now at af12da8 Integrate input/output buffers into IntcodeComputer
day19                                   ... 13,590,871 ns (+/- 250,507)
day21                                   ...  8,675,619 ns (+/- 258,938)
day23                                   ... 72,767,182 ns (+/- 1,057,741)
day25                                   ... 35,118,589 ns (+/- 371,986)
intcode::day13_emlun                    ...  7,275,632 ns (+/- 52,053)
intcode::day9_example_1_clone           ...      1,268 ns (+/- 20)
intcode::day9_example_1_new             ...      1,190 ns (+/- 16)
intcode::day9_example_2_clone           ...         70 ns (+/- 0)
intcode::day9_example_2_new             ...         67 ns (+/- 1)
intcode::day9_example_3_clone           ...         55 ns (+/- 1)
intcode::day9_example_3_new             ...         53 ns (+/- 4)
intcode_iagueqnar::ackermann_3_6        ... 19,055,981 ns (+/- 265,802)
intcode_iagueqnar::factor_19338240      ...  3,093,071 ns (+/- 24,871)
intcode_iagueqnar::factor_2147483647    ... 74,157,281 ns (+/- 440,360)
intcode_iagueqnar::sum_of_primes_100000 ... 27,930,015 ns (+/- 209,639)

Benchmark after the second commit:

HEAD is now at 5b8bb16 Add method IntcodeComputer::run_mut(input)
day19                                   ... 13,810,763 ns (+/- 107,909)
day21                                   ...  8,913,724 ns (+/- 349,600)
day23                                   ...  2,893,669 ns (+/- 36,232)
day25                                   ... 35,575,441 ns (+/- 1,181,292)
intcode::day13_emlun                    ...  7,491,240 ns (+/- 84,875)
intcode::day9_example_1_clone           ...      1,271 ns (+/- 24)
intcode::day9_example_1_new             ...      1,184 ns (+/- 17)
intcode::day9_example_2_clone           ...         69 ns (+/- 1)
intcode::day9_example_2_new             ...         64 ns (+/- 0)
intcode::day9_example_3_clone           ...         54 ns (+/- 0)
intcode::day9_example_3_new             ...         48 ns (+/- 0)
intcode_iagueqnar::ackermann_3_6        ... 17,004,031 ns (+/- 258,163)
intcode_iagueqnar::factor_19338240      ...  3,067,312 ns (+/- 26,751)
intcode_iagueqnar::factor_2147483647    ... 73,583,269 ns (+/- 1,742,256)
intcode_iagueqnar::sum_of_primes_100000 ... 27,856,705 ns (+/- 221,690)

The last few benchmarks here are from an Intcode benchmarking suite made by Reddit user iagueqnar. We can see here that these benchmarks suffer a bit from these changes as they don’t use much input or output, and the integrated input/output buffers do take a bit of extra time to allocate.