emlun.seMy name is Emil Lundberg. I sometimes write here about things I think about or interesting problems and the solutions I find.
https://emlun.se/
Sun, 23 May 2021 19:32:14 +0200Sun, 23 May 2021 19:32:14 +0200Jekyll v4.1.1Advent of Code 2019 in 110 ms: optimizing Intcode<p>I realize that throughout the previous posts I promised
to get back to performance optimization for the Intcode engine,
but never did.
This is mostly due to how these optimizations are very language-
and implementation specific, which didn’t fit well with the algorithmic
focus of the rest of the series.
But in the interest of completeness, I’ll go through that here.</p>
<p>For this runtime benchmark, I’ll be using the solutions
for days 19, 21, 23 and 25 as those are quite heavy on Intcode,
as well as a few additional pure Intcode test cases.
The format here will be closer to a lightly edited Git log
than the other posts:
I’ll go through the change log for my Intcode module,
pick out the changes that made noteworthy performance improvements,
and present those changes along with the associated benchmark.
Because I was also improving the solutions on their own,
I’ll be presenting the benchmarks for both before and after each change.</p>
<h2 id="change-1-use-vecextend-instead-of-append">Change 1: Use Vec::extend instead of append</h2>
<p>My Intcode machine uses a <code class="language-plaintext highlighter-rouge">Vec</code> for the machine memory,
and dynamically expands it whenever the program writes to an address
outside the current range of the vector.
Before this change, this expansion was done using this code snippet:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if size >= prog.len() {
prog.append(&mut (0..=0).cycle().take(size - prog.len() + 1).collect());
}
</code></pre></div></div>
<p>This uses <code class="language-plaintext highlighter-rouge">.cycle()</code> to generate an infinite series of zeroes,
then <code class="language-plaintext highlighter-rouge">.take()</code> to limit the series to the number of additional elements we need,
uses this sequence to construct a <code class="language-plaintext highlighter-rouge">Vec</code> of zeroes,
and finally appends all items from that vector to the Intcode memory vector.</p>
<p>The update changes this to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if size >= prog.len() {
prog.extend((0..=0).cycle().take(size - prog.len() + 1));
}
</code></pre></div></div>
<p>which simply skips constructing the intermediate <code class="language-plaintext highlighter-rouge">Vec</code>.
This saves a small amount of work and makes a small but measurable runtime improvement.</p>
<p>Benchmark before:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HEAD is now at 1960dae Eliminate unnecessary packet_queues variable
day19 ... 38,981,295 ns (+/- 436,571)
day21 ... 28,319,936 ns (+/- 1,245,536)
day23 ... 180,519,980 ns (+/- 3,075,031)
day25 ... 114,856,086 ns (+/- 1,681,529)
intcode::day13_emlun ... 25,581,262 ns (+/- 506,558)
intcode::day9_example_1_clone ... 3,096 ns (+/- 54)
intcode::day9_example_1_new ... 3,015 ns (+/- 126)
intcode::day9_example_2_clone ... 97 ns (+/- 4)
intcode::day9_example_2_new ... 108 ns (+/- 1)
intcode::day9_example_3_clone ... 61 ns (+/- 0)
intcode::day9_example_3_new ... 62 ns (+/- 1)
</code></pre></div></div>
<p>Benchmark after:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HEAD is now at 97c4808 Use Vec::extend instead of append
day19 ... 33,946,146 ns (+/- 325,246)
day21 ... 26,049,742 ns (+/- 564,147)
day23 ... 175,287,999 ns (+/- 2,670,819)
day25 ... 101,093,552 ns (+/- 1,836,310)
intcode::day13_emlun ... 23,006,113 ns (+/- 270,564)
intcode::day9_example_1_clone ... 3,085 ns (+/- 60)
intcode::day9_example_1_new ... 3,103 ns (+/- 108)
intcode::day9_example_2_clone ... 108 ns (+/- 4)
intcode::day9_example_2_new ... 115 ns (+/- 2)
intcode::day9_example_3_clone ... 64 ns (+/- 1)
intcode::day9_example_3_new ... 72 ns (+/- 0)
</code></pre></div></div>
<h2 id="change-2-use-vecresize-instead-of-extend">Change 2: Use Vec::resize instead of extend</h2>
<p>This change directly follows the previous,
so the benchmark before this change is the same as the benchmark after the previous.
The change is another small improvement to the same function,
updating the same code snippet to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if size >= prog.len() {
prog.resize(size, 0);
}
</code></pre></div></div>
<p>This instead uses the method <code class="language-plaintext highlighter-rouge">Vec::resize(size, value)</code>
which is meant for this exact use case.
This skips constructing the iterator of zeroes,
and again makes a small but measurable improvement.</p>
<p>Benchmark after:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HEAD is now at c233afc Use Vec::resize instead of extend
day19 ... 33,644,412 ns (+/- 452,858)
day21 ... 25,756,035 ns (+/- 345,314)
day23 ... 169,920,762 ns (+/- 2,603,704)
day25 ... 104,185,852 ns (+/- 972,813)
intcode::day13_emlun ... 23,379,818 ns (+/- 299,100)
intcode::day9_example_1_clone ... 2,828 ns (+/- 123)
intcode::day9_example_1_new ... 2,855 ns (+/- 156)
intcode::day9_example_2_clone ... 105 ns (+/- 4)
intcode::day9_example_2_new ... 110 ns (+/- 5)
intcode::day9_example_3_clone ... 66 ns (+/- 4)
intcode::day9_example_3_new ... 71 ns (+/- 3)
</code></pre></div></div>
<h2 id="change-3-use-match-expression-instead-of-i64pow">Change 3: Use match expression instead of i64::pow</h2>
<p>Intcode instructions include <em>parameter modes</em> for each parameter,
which is stored in the 1000s decimal digit for the first parameter,
the 10,000s digit for the second parameter, and so on.
I initially used exponentiation to compute the mask for
the parameter mode;
this change instead replaces that with a <code class="language-plaintext highlighter-rouge">match</code> expression
with explicit mappings for the first 3 parameters:</p>
<figure class="highlight"><pre><code class="language-diff" data-lang="diff"><span class="gd">- let parmode_pow = 10_i64.pow((offset + 1) as u32);
</span><span class="gi">+ let parmode_pow = match offset {
+ 1 => 100,
+ 2 => 1000,
+ 3 => 10000,
+ _ => unreachable!(),
+ };</span></code></pre></figure>
<p>This saves a bit of time since exponentiation is a relatively costly operation.
The difference is most noticeable for day 23 and 25:</p>
<p>Benchmark before:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HEAD is now at ba889a9 Use lossy conversions instead of try_into()
day19 ... 33,994,137 ns (+/- 769,221)
day21 ... 24,832,666 ns (+/- 808,367)
day23 ... 180,145,616 ns (+/- 1,220,217)
day25 ... 101,699,713 ns (+/- 1,994,839)
intcode::day13_emlun ... 22,439,123 ns (+/- 1,059,112)
intcode::day9_example_1_clone ... 2,910 ns (+/- 94)
intcode::day9_example_1_new ... 2,891 ns (+/- 104)
intcode::day9_example_2_clone ... 102 ns (+/- 4)
intcode::day9_example_2_new ... 110 ns (+/- 5)
intcode::day9_example_3_clone ... 69 ns (+/- 3)
intcode::day9_example_3_new ... 75 ns (+/- 3)
</code></pre></div></div>
<p>Benchmark after:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HEAD is now at d089d6e Use match expression instead of i64::pow
day19 ... 31,870,599 ns (+/- 212,298)
day21 ... 23,383,442 ns (+/- 1,349,296)
day23 ... 167,453,722 ns (+/- 2,373,542)
day25 ... 94,718,539 ns (+/- 1,589,987)
intcode::day13_emlun ... 20,048,269 ns (+/- 685,314)
intcode::day9_example_1_clone ... 2,633 ns (+/- 57)
intcode::day9_example_1_new ... 2,598 ns (+/- 83)
intcode::day9_example_2_clone ... 98 ns (+/- 7)
intcode::day9_example_2_new ... 97 ns (+/- 5)
intcode::day9_example_3_clone ... 58 ns (+/- 2)
intcode::day9_example_3_new ... 59 ns (+/- 1)
</code></pre></div></div>
<h2 id="change-4-replace-get_args-with-get_arg">Change 4: Replace get_args with get_arg</h2>
<p>This was one of the really significant changes.
Before the change, I had an internal function <code class="language-plaintext highlighter-rouge">get_args(num)</code>,
which parses out the values of <code class="language-plaintext highlighter-rouge">num</code> parameters
for the current instruction and returns the result as a <code class="language-plaintext highlighter-rouge">Vec</code>.
The change replaces this with the function <code class="language-plaintext highlighter-rouge">get_arg(arg_num)</code>
which only parses out the one parameter identified by <code class="language-plaintext highlighter-rouge">arg_num</code>.
Humble as it may seem, this saves <em>a lot</em> of time - about 60% -
because it eliminates an intermediate <code class="language-plaintext highlighter-rouge">Vec</code> for <em>every instruction</em>,
a <code class="language-plaintext highlighter-rouge">Vec</code> which never had more than 2 elements.</p>
<p>Benchmark before:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HEAD is now at d6fe561 Eliminate method IntcodeComputer::is_halted
day19 ... 31,773,265 ns (+/- 429,850)
day21 ... 22,885,367 ns (+/- 1,535,141)
day23 ... 166,465,638 ns (+/- 1,867,714)
day25 ... 95,685,174 ns (+/- 1,363,361)
intcode::day13_emlun ... 20,137,841 ns (+/- 346,307)
intcode::day9_example_1_clone ... 2,618 ns (+/- 84)
intcode::day9_example_1_new ... 2,607 ns (+/- 34)
intcode::day9_example_2_clone ... 96 ns (+/- 3)
intcode::day9_example_2_new ... 97 ns (+/- 1)
intcode::day9_example_3_clone ... 59 ns (+/- 0)
intcode::day9_example_3_new ... 61 ns (+/- 1)
</code></pre></div></div>
<p>Benchmark after:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HEAD is now at 38366c8 Replace get_args with get_arg
day19 ... 13,743,595 ns (+/- 208,007)
day21 ... 10,213,446 ns (+/- 434,783)
day23 ... 69,750,626 ns (+/- 515,868)
day25 ... 37,981,223 ns (+/- 1,256,944)
intcode::day13_emlun ... 8,250,930 ns (+/- 229,193)
intcode::day9_example_1_clone ... 1,130 ns (+/- 21)
intcode::day9_example_1_new ... 1,133 ns (+/- 105)
intcode::day9_example_2_clone ... 61 ns (+/- 0)
intcode::day9_example_2_new ... 67 ns (+/- 1)
intcode::day9_example_3_clone ... 45 ns (+/- 2)
intcode::day9_example_3_new ... 51 ns (+/- 1)
</code></pre></div></div>
<h2 id="change-5-integrate-inputoutput-buffers-into-intcodecomputer">Change 5: Integrate input/output buffers into IntcodeComputer</h2>
<p>This change is really two commits, but they’re tightly connected
as the first makes the second possible,
and while the first on its own makes little difference,
the second makes a huge difference for day 23.</p>
<p>Initially, my Intcode machine didn’t have integrated buffers for input and output.
Instead, the <code class="language-plaintext highlighter-rouge">step</code> method took the input for the current step as a parameter,
and returned the output as an <code class="language-plaintext highlighter-rouge">Option</code> value.
The input would still only be consumed if the instruction needed it.
Apart from this, there was a <code class="language-plaintext highlighter-rouge">run</code> method that accepted an input sequence
and ran the program to completion,
with no option to modify input based on output feedback.
For that purpose, there was a few variations of a <code class="language-plaintext highlighter-rouge">run_with</code> function,
which takes an initial <em>state</em> and a <em>reducer function</em>,
and runs the program to completion with that reducer.
After each step, the reducer is called with the current state and any Intcode output,
and returns a new state and any Intcode input for the next step.
This all works well enough, but it is a pretty roundabout way to do things
just because I wanted to try doing it using functional programming techniques.</p>
<p>The first of these two commits adds integrated input and output buffers,
and the second adds a <code class="language-plaintext highlighter-rouge">run_mut</code> method
which just runs the program until it halts or until more input is needed.
Before this, my solution for day 23 ran all the Intcode computers one step at a time,
then did all the logic for the challenge, then looped back for one more step.
With the new <code class="language-plaintext highlighter-rouge">run_mut</code> method, it instead keeps running each computer
until it needs more input.
This alone turns out to reduce runtime from ~70 ms to ~2.9 ms.</p>
<p>Benchmark before the first commit:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HEAD is now at 33fadc5 Use .iter().enumerate() instead of index range
day19 ... 13,661,870 ns (+/- 358,065)
day21 ... 10,072,786 ns (+/- 516,722)
day23 ... 69,739,826 ns (+/- 1,497,667)
day25 ... 37,456,243 ns (+/- 914,412)
intcode::day13_emlun ... 8,204,654 ns (+/- 84,022)
intcode::day9_example_1_clone ... 1,152 ns (+/- 18)
intcode::day9_example_1_new ... 1,104 ns (+/- 19)
intcode::day9_example_2_clone ... 52 ns (+/- 0)
intcode::day9_example_2_new ... 54 ns (+/- 0)
intcode::day9_example_3_clone ... 39 ns (+/- 1)
intcode::day9_example_3_new ... 39 ns (+/- 1)
intcode_iagueqnar::ackermann_3_6 ... 15,092,942 ns (+/- 56,588)
intcode_iagueqnar::factor_19338240 ... 2,983,820 ns (+/- 32,823)
intcode_iagueqnar::factor_2147483647 ... 67,799,823 ns (+/- 762,711)
intcode_iagueqnar::sum_of_primes_100000 ... 25,967,959 ns (+/- 200,632)
</code></pre></div></div>
<p>Benchmark after the first commit:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HEAD is now at af12da8 Integrate input/output buffers into IntcodeComputer
day19 ... 13,590,871 ns (+/- 250,507)
day21 ... 8,675,619 ns (+/- 258,938)
day23 ... 72,767,182 ns (+/- 1,057,741)
day25 ... 35,118,589 ns (+/- 371,986)
intcode::day13_emlun ... 7,275,632 ns (+/- 52,053)
intcode::day9_example_1_clone ... 1,268 ns (+/- 20)
intcode::day9_example_1_new ... 1,190 ns (+/- 16)
intcode::day9_example_2_clone ... 70 ns (+/- 0)
intcode::day9_example_2_new ... 67 ns (+/- 1)
intcode::day9_example_3_clone ... 55 ns (+/- 1)
intcode::day9_example_3_new ... 53 ns (+/- 4)
intcode_iagueqnar::ackermann_3_6 ... 19,055,981 ns (+/- 265,802)
intcode_iagueqnar::factor_19338240 ... 3,093,071 ns (+/- 24,871)
intcode_iagueqnar::factor_2147483647 ... 74,157,281 ns (+/- 440,360)
intcode_iagueqnar::sum_of_primes_100000 ... 27,930,015 ns (+/- 209,639)
</code></pre></div></div>
<p>Benchmark after the second commit:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HEAD is now at 5b8bb16 Add method IntcodeComputer::run_mut(input)
day19 ... 13,810,763 ns (+/- 107,909)
day21 ... 8,913,724 ns (+/- 349,600)
day23 ... 2,893,669 ns (+/- 36,232)
day25 ... 35,575,441 ns (+/- 1,181,292)
intcode::day13_emlun ... 7,491,240 ns (+/- 84,875)
intcode::day9_example_1_clone ... 1,271 ns (+/- 24)
intcode::day9_example_1_new ... 1,184 ns (+/- 17)
intcode::day9_example_2_clone ... 69 ns (+/- 1)
intcode::day9_example_2_new ... 64 ns (+/- 0)
intcode::day9_example_3_clone ... 54 ns (+/- 0)
intcode::day9_example_3_new ... 48 ns (+/- 0)
intcode_iagueqnar::ackermann_3_6 ... 17,004,031 ns (+/- 258,163)
intcode_iagueqnar::factor_19338240 ... 3,067,312 ns (+/- 26,751)
intcode_iagueqnar::factor_2147483647 ... 73,583,269 ns (+/- 1,742,256)
intcode_iagueqnar::sum_of_primes_100000 ... 27,856,705 ns (+/- 221,690)
</code></pre></div></div>
<p>The last few benchmarks here are from an <a href="https://www.reddit.com/r/adventofcode/comments/egq9xn/2019_day_9_intcode_benchmarking_suite/">Intcode benchmarking suite</a>
made by Reddit user <code class="language-plaintext highlighter-rouge">iagueqnar</code>.
We can see here that these benchmarks suffer a bit from these changes
as they don’t use much input or output,
and the integrated input/output buffers do take a bit of extra time to allocate.</p>
Thu, 27 Aug 2020 00:00:00 +0200
https://emlun.se/advent-of-code-2019/2020/08/27/followup-intcode.html
https://emlun.se/advent-of-code-2019/2020/08/27/followup-intcode.htmladvent-of-codealgorithmsdata-structuresenglishprogrammingrustperformanceadvent-of-code-2019Advent of Code 2019 in 110 ms: day 3 revisited<p>After I posted this series <a href="https://www.reddit.com/r/adventofcode/comments/igwi11/2019_optimized_solutions_in_rust_130_ms_total">to the Advent of Code subreddit</a>,
Reddit user askalski - who has previously posted optimized solutions in C and C++,
and was one of my inspirations for doing the same -
<a href="https://www.reddit.com/r/adventofcode/comments/igwi11/2019_optimized_solutions_in_rust_130_ms_total/g2xq4gp">suggested</a> a better solution for <a href="/advent-of-code-2019/2020/08/26/day-01-05.html">day 3</a>:
processing the wires as line segments instead of point sets.
This reduces runtime by 99.9%, down to 28 μs.</p>
<p>This idea had crossed my mind, but I had rejected it
thinking it wouldn’t improve performance enough to offset the increased complexity per operation -
the algorithm I had was linear in complexity, after all.
Let’s just say I was hilariously wrong.</p>
<p>The new algorithm is in a way simpler than the hash set method:
parse each wire into a vector of line segments,
then iterate through each pair of line segments and check if they intersect.
This can be done quite quickly since we know all line segments are straight lines
parallel with one of the coordinate axes.</p>
<figure id="fig01">
<div class="images">
<a href="/advent-of-code-2019/line-intersection.svg">
<img src="/advent-of-code-2019/line-intersection.svg" />
</a>
</div>
<figcaption>
Figure 1: Finding the intersection of two axis-parallel lines
</figcaption>
</figure>
<p>It is easy to see from <a href="#fig01">figure 1</a> that an intersection exists
if and only if:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(x10 <= x00 <= x11 AND
y00 <= y10 <= y01)
OR
(x00 <= x10 <= x01 AND
y10 <= y00 <= y11)
</code></pre></div></div>
<p>The drawback of this approach is that overlapping lines become more cumbersome to support.
Detecting the overlap is easy enough, but for part 2 we also need to track the path length
along each line.
I didn’t want to deal with this, so I decided to simply crash on overlapping lines.</p>
<p>Using the above rules, we can simply iterate over all pairs of line segments
to find the set of intersections.
Although this has time complexity O(N<sup>2</sup>) and the hash set method has complexity O(N),
my puzzle input has 301 + 301 line segments spanning 155,678 + 149,517 points.
It turns out 301 * 301 = 90,601, which is significantly smaller than both 155,678 and 149,517.
Not only does this mean we need to make fewer tests,
it also means we don’t need to hash each point several times for the hash set operations.</p>
<p>For part 2, we can extend our line segment representation
to not only store the two end points,
but also the first point along the path from the origin
as well as the total path length.
When we compute an intersection, we can use these values to compute
the total path length to that intersection.
With that, we simply need to visit each intersection once
to compute both solutions we need.
With these optimizations, both parts take ~130 μs to solve.</p>
<p>There’s one more trick we can use to go even faster, though:
we don’t need to test every pair of line segments.
Instead, we can split the first wire into the line segments parallel with the x axis,
and the ones parallel with the y axis.
We can then sort each vector by y and x coordinate, respectively.
This allows us to binary search for just the range of segments
that have a chance to intersect the other wire,
further reducing runtime for this problem to 28 μs
and the total runtime to 110 ms.</p>
Wed, 26 Aug 2020 00:00:00 +0200
https://emlun.se/advent-of-code-2019/2020/08/26/followup-day03-revisited.html
https://emlun.se/advent-of-code-2019/2020/08/26/followup-day03-revisited.htmladvent-of-codealgorithmsdata-structuresenglishprogrammingrustperformanceadvent-of-code-2019Advent of Code 2019 in 130 ms: closing thoughts<p>And so this series of posts comes to a close.
I hope I was able to help you learn something -
I certainly did while developing these solutions.
Even writing these posts, as I went through the solutions again
I found an additional 34 ms of time save
even after I’d decided that 164 ms was a good enough time to settle on.</p>
<p>My main reflection after all this is that if you want to write fast programs,
the first and most important thing to do is to make it easy to measure performance.
Once that is set up and you can test ideas against hard, objective observations,
you can start poking around and see what happens.
It actually turns out to be surprisingly easy - at least in Rust,
but really one of the most important factors is being familiar
with some standard data structures and their strengths and weaknesses.
Algorithms also make a big difference, of course,
but it can be quite impressive to see a program run some 6 times faster
just by changing a collection from a hash map to a vector.</p>
<p>Another reflection is, of course, that Rust is indeed blazingly fast.
It’s one thing to read about zero-cost abstractions as an idea,
but another to see firsthand that iterators and other high-level concepts
are indeed just as fast or even faster than a traditional for loop.
Still, you do need to mind your data -
subtle things like <code class="language-plaintext highlighter-rouge">.collect()</code>ing a <code class="language-plaintext highlighter-rouge">Vec</code> you don’t really need
can have a <a href="/advent-of-code-2019/2020/08/27/followup-intcode.html">surprisingly large impact</a> on performance.
The good thing, though, is that it’s often easy to know when you’re being inefficient,
as there’s often a <code class="language-plaintext highlighter-rouge">.clone()</code> or <code class="language-plaintext highlighter-rouge">.collect()</code> around.</p>
<p>I’d like to thank <a href="https://twitter.com/ericwastl">Eric Wastl</a> for making Advent of Code these last few years.
They’ve been a lot of fun to solve, and a great way to practice a new programming language.
I’m looking forward to this year’s edition.
And of course: thank <em>you</em> for reading, and good luck with your current and future projects!</p>
Wed, 26 Aug 2020 00:00:00 +0200
https://emlun.se/advent-of-code-2019/2020/08/26/day-99-outro.html
https://emlun.se/advent-of-code-2019/2020/08/26/day-99-outro.htmladvent-of-codealgorithmsdata-structuresenglishprogrammingrustperformanceadvent-of-code-2019Advent of Code 2019 in 130 ms: day 25<h2 id="day-25-cryostasis">Day 25: Cryostasis</h2>
<p><a href="https://adventofcode.com/2019/day/25">The final challenge</a> is a text-based adventure game in Intcode,
and much like previous Intcode challenges benefits a lot
from minimizing the amount of Intcode you need to run.
<a href="https://github.com/emlun/adventofcode-2019/blob/master/src/days/day25.rs">My solution</a> runs in 5.7 ms,
using an initial depth-first search phase
and then making use of <a href="https://en.wikipedia.org/wiki/Gray_code">Gray code</a> to minimize
the number of inventory changes while cracking the combination for the lock.</p>
<!--more-->
<p>The program runs in three phases: collect all the items,
navigate to the security room, and then crack the item combination.</p>
<p>The collection phase is done using a depth-first search,
keeping track of unexplored rooms in a stack.
Whenever we find an item, we pick it up
unless it’s on a hardcoded list of “bad items”.
I haven’t thought of a good way to detect the bad items automatically,
since they have several different effects -
it seems like you’d need special-purpose code for some of them anyway,
so it’s easier to just hardcode them.
Anyway, when we encounter the security room,
we start tracking the path to get back there
using a second stack of moves.
Once there are no more unexplored rooms, we move to the navigation state.</p>
<p>In the navigation state, we simply follow the security room move stack
to get back to the security room in the minimum number of moves.
Once there, we move to the unlock state.</p>
<p>In the unlock state, we try item combinations until we win the game.
We start by holding all items,
and then drop and pick up items in different combinations.
Since each attempt requires one call to the Intcode program,
we can save a lot of time with a good strategy.
One way we can do this is by minimizing the number of times we need
to drop or pick up an item.
We can encode the items we’re holding as a binary number,
where each digit represents whether we’re holding one item.
We can then repeatedly increment this number
to iterate through all combinations of items.</p>
<p>To reduce the number of attempts we need,
we can make use of the fact that the game tells us
if we’re holding too much or too little weight.
We don’t know how much each item weighs,
but we can still keep track of the combinations we’ve tried.
If the next combination contains all items of a combination we know was too heavy,
we can just skip it because we know it will definitely also be too heavy.
Likewise, if the next combination is a subset of a combination we know was too light,
we can skip that too.
This saves about 75% runtime for my puzzle input.</p>
<p>Lastly, we can be a bit smarter with how we iterate through attempts.
To minimize the number of items we need to drop or pick up,
we want to change as few bits of our combination number as possible between attempts.
To do that, we can represent the held items as <a href="https://en.wikipedia.org/wiki/Gray_code">Gray code</a>.
This is a binary code where successive numbers differ only by one bit,
which is exactly what we want.
This benefit is diminished a bit by skipping redundant attempts,
but still helps.
If we also start by holding half the items instead of all of them -
so we can make better use of both the “too light” and “too heavy” information -
this saves an additional 44% runtime.
All in all, these optimizations bring us down to 5.7 ms total.</p>
Wed, 26 Aug 2020 00:00:00 +0200
https://emlun.se/advent-of-code-2019/2020/08/26/day-25.html
https://emlun.se/advent-of-code-2019/2020/08/26/day-25.htmladvent-of-codealgorithmsdata-structuresenglishprogrammingrustperformanceadvent-of-code-2019Advent of Code 2019 in 130 ms: day 24<h2 id="day-24-planet-of-discord">Day 24: Planet of Discord</h2>
<p><a href="https://adventofcode.com/2019/day/24">Day 24</a> is a variant of Conway’s game of life.
<a href="https://github.com/emlun/adventofcode-2019/blob/master/src/days/day24.rs">My solution</a> runs in 2.6 ms and gains its performance
from efficient data structures,
including a custom bit field representation of Boolean matrices
similar to the key sets in <a href="/advent-of-code-2019/2020/08/26/day-18.html">day 18</a>.</p>
<!--more-->
<p>The basic principle for this challenge is simple:
propagate the world state forward in time,
and detect when any state occurs a second time.
Part 2 introduces new connectivity rules,
but otherwise remains the same and asks us to simulate 200 time steps.</p>
<p>A lot of the work in this challenge consists of looking up and counting neighbors.
Each cell of the game can only be in one of two states,
so an easy solution is to represent the state
as a 2-dimensional vector of Boolean values,
but a more efficient representation is to use a bit field.</p>
<p>The state of each level is 5x5 cells, so we could fit all 25 cells
into a 32-bit integer value.
But it’s going to be useful to have one cell of padding on each side,
meaning our states will be 7x7 cells, so we’ll need to use 64-bit integers.
We slice the integer into 7-bit rows, in order from least to most significant:
the first 7 bits are always zero,
the first and last bit of each group of 7 are always zero,
and the middle 5 bits of the middle 5 groups are the cells of each row,
as shown in <a href="#fig14">figure 14</a>.</p>
<figure id="fig14">
<pre>
Original: Padded:
0000000 0 - 6
....# 0000010 7 - 13
#..#. 0100100 14 - 20
#..## => 0100110 21 - 27
..#.. 0001000 28 - 34
#.... 0100000 35 - 41
0000000 42 - 48
As integer:
4 4 3 2 1 0
8765432 1098765 4321098 7654321 0987654 3210987 6543210
-------------------------------------------------------
0000000 0000010 0001000 0110010 0010010 0100000 0000000
</pre>
<figcaption>
Figure 14: Example state encoded as a 49-bit integer
</figcaption>
</figure>
<p>With this representation, we can easily count all the neighbors of a given cell
using the number <code class="language-plaintext highlighter-rouge">33410</code>, or <code class="language-plaintext highlighter-rouge">0b000001000001010000010</code>, as a bit mask:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0100000
1010000
0100000
0000000
0000000
0000000
0000000
</code></pre></div></div>
<p>This is why the padding is useful -
we don’t need to spend cycles on worrying about the edges.
All we need to do is bit-shift this pattern
by the index of the cell we want the neighbors for,
then AND the shifted pattern with the bit field,
and count the number of ones in the result.
Incidentally, Rust integers have a method <code class="language-plaintext highlighter-rouge">.count_ones()</code> which does exactly that.
The index for the cell at coordinates <code class="language-plaintext highlighter-rouge">(x, y)</code> in the original 5x5 state
is simply <code class="language-plaintext highlighter-rouge">7 * (y + 1) + (x + 1)</code>.</p>
<p>This compact representation is also suitable for fast lookups in a hash set
for the state repetition check in part 1.
Unfortunately it isn’t quite as helpful for part 2 due to the inter-level connections,
but it does still approximately halve runtime
compared to storing the state as a <code class="language-plaintext highlighter-rouge">Vec<Vec<bool>></code>.</p>
<p>For part 2, we also need a way to store the levels.
Since levels extend in two directions, a natural numbering scheme is
to start at zero and count into both positive and negative numbers,
but this means you can’t simply use the level as the index of a vector.
You can use a hash map instead, but that’s slow,
and there is a simple trick we can use to store items at “negative” indices in a vector.
The trick is to interleave the positive and negative indices:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Level: 0 -1 1 -2 2 -3 3 -4 4 -5 ...
Index: 0 1 2 3 4 5 6 7 8 9 ...
</code></pre></div></div>
<p>That is, level <code class="language-plaintext highlighter-rouge">n</code> is stored at index <code class="language-plaintext highlighter-rouge">n * 2</code> if <code class="language-plaintext highlighter-rouge">n</code> is nonnegative,
and <code class="language-plaintext highlighter-rouge">|n| * 2 - 1</code> if <code class="language-plaintext highlighter-rouge">n</code> is negative.
This turns out to be <em>much</em> more efficient than a hash map,
cutting runtime by no less than 75%.
Also making the effort to only create new levels
when one is actually going to be populated
(which is at most every second time step, due to how the problem is set up)
saves an additional 30%.</p>
<h3 id="going-even-further-beyond">Going even further beyond</h3>
<p>I wrote above that part 2 spoils the fun of the bit field a bit,
but we <em>can</em> in fact make it work for part 2 also.
We can do this by embedding the relevant parts of the neighbor levels,
and using special neighbor masks for the middle 4 cells.
We can encode the state as shown in <a href="#fig15">figure 15</a>:</p>
<figure id="fig15">
<pre>
Level X: Level Y: Level Z:
WWWWW AAAAA YYYYY
W ..... W D ..... B Y aaaaa Y
W ..A.. W D ..... B Y d...b Y
W .DYB. W D ..Z.. B Y d...b Y
W ..C.. W D ..... B Y d...b Y
W ..... W D ..... B Y ccccc Y
WWWWW CCCCC YYYYY
Level Y encoded:
.AAAAA. 0 - 6 Y: The cells of level Y
DYYYYYB 7 - 13
DYYYYYB 14 - 20 A, B, C, D: The respective
DYY.YYB 21 - 27 cell in level X
DYYYYYB 28 - 34
DYYYYYB 35 - 41 a: The top row of
dCCCCCb 42 - 48 cells in level Z
daaaaab 49 - 55
dcccccb 56 - 62 b: The middle 3 cells on
. 63 the right of level Z
c: The bottom row of
cells in level Z
.: Unused bit d: The middle 3 cells on
the left of level Z
</pre>
<figcaption>
Figure 15: Example state encoded as a 49-bit integer
</figcaption>
</figure>
<p>This encoding is chosen to minimize the number of operations needed
to embed the neighbor states:
Each edge of cells from the outer neighbor can be masked in
with just one bitwise OR operation,
and the cells from the inner neighbor can also be extracted
and moved into place with just 3 bitwise logic operations per set.
Also important is that all the masks can be precomputed, i.e. hard-coded.</p>
<p>This way, we get the neighbors from the outer level for free
with the basic neighbor mask,
and the neighbors from the inner level can be accounted for
by using separate neighbor masks specifically for the middle 4 cells.
This again cuts the runtime in half,
bring the total runtime down to 2.6 ms.</p>
Wed, 26 Aug 2020 00:00:00 +0200
https://emlun.se/advent-of-code-2019/2020/08/26/day-24.html
https://emlun.se/advent-of-code-2019/2020/08/26/day-24.htmladvent-of-codealgorithmsdata-structuresenglishprogrammingrustperformanceadvent-of-code-2019Advent of Code 2019 in 130 ms: days 20 - 23<h2 id="day-20-donut-maze">Day 20: Donut Maze</h2>
<p><a href="https://adventofcode.com/2019/day/20">Day 20</a> is a maze problem similar to <a href="/advent-of-code-2019/2020/08/26/day-18.html">day 18</a>.
<a href="https://github.com/emlun/adventofcode-2019/blob/master/src/days/day20.rs">My solution</a> runs in 11 ms and uses the same performance tricks.
The only real difference is in the rules for generating new states,
and that this solution doesn’t use custom duplication keys
since the keys for the hash set of visited locations are already quite small.</p>
<!--more-->
<h2 id="day-21-springdroid-adventure">Day 21: Springdroid Adventure</h2>
<p><a href="https://adventofcode.com/2019/day/21">Day 21</a> is a pure logic puzzle,
and all you can really do for performance optimization
is optimize the Intcode engine
and minimize the springscript program.
<a href="https://github.com/emlun/adventofcode-2019/blob/master/src/days/day21.rs">My solution</a> runs in 8.8 ms.</p>
<p>For part 1 I use the program <code class="language-plaintext highlighter-rouge">(!A || !B || !C) && D</code> -
think of it as “I must jump soon, and I can land if I jump now”.
Using Boolean algebra, we can express this in fewer operations as
<code class="language-plaintext highlighter-rouge">!(A && B && C) && D</code>, which works out to 5 springscript instructions.</p>
<p>For part 2 I use the program <code class="language-plaintext highlighter-rouge">(!A || !B || !C) && D && (E || H)</code>,
with the rationale</p>
<ul>
<li>I must jump soon: <code class="language-plaintext highlighter-rouge">!A || !B || !C</code></li>
<li>I can land if I jump now: <code class="language-plaintext highlighter-rouge">D</code></li>
<li>After landing, I can take one step forward or immediately jump again: <code class="language-plaintext highlighter-rouge">E || H</code></li>
</ul>
<p>Again using a little Boolean algebra, we can express this as
<code class="language-plaintext highlighter-rouge">!(A && B && C) && D && (E || H)</code>, which works out to 8 springscript instructions.</p>
<h2 id="day-22-slam-shuffle">Day 22: Slam Shuffle</h2>
<p><a href="https://adventofcode.com/2019/day/22">Challenge 22</a> is a feast of modular arithmetic,
and is fun because it has some of the season’s largest input numbers
but shortest runtimes.
<a href="https://github.com/emlun/adventofcode-2019/blob/master/src/days/day22.rs">My solution</a> expresses the shuffle as a polynomial function,
and runs in 450 μs.
I believe most of these tricks, in some form, are necessary to solve part 2 at all.</p>
<p>Both parts are variants of the same challenge, but with different inputs:
we’re deterministically shuffling a deck of cards, and need to identify
which card ends up in a given position after the shuffle.
We have three shuffle operations: <code class="language-plaintext highlighter-rouge">stack()</code>, <code class="language-plaintext highlighter-rouge">cut(n)</code> and <code class="language-plaintext highlighter-rouge">deal(n)</code>.
<code class="language-plaintext highlighter-rouge">stack()</code> reverses the order; <code class="language-plaintext highlighter-rouge">cut(n)</code> moves the first <code class="language-plaintext highlighter-rouge">n</code> to the back,
or the last <code class="language-plaintext highlighter-rouge">|n|</code> to the front if <code class="language-plaintext highlighter-rouge">n</code> is negative;
and <code class="language-plaintext highlighter-rouge">deal(n)</code> move each card from index <code class="language-plaintext highlighter-rouge">i</code> to index <code class="language-plaintext highlighter-rouge">i * n</code>.
<a href="#fig13">Figure 13</a> shows an example.</p>
<figure id="fig13">
<pre>
id(): 0 1 2 3 4 5 6 7 8 9 10 11 12
stack(): 12 11 10 9 8 7 6 5 4 3 2 1 0
cut(3): 3 4 5 6 7 8 9 10 11 12 0 1 3
deal(5): 0 8 3 11 6 1 9 4 12 7 2 10 5
</pre>
<figcaption>
Figure 13: The basic shuffle operations operating on the sequence 0, 1, ..., 12
</figcaption>
</figure>
<p>I’ve also added <code class="language-plaintext highlighter-rouge">id()</code> as the “do nothing” shuffle,
which leaves all cards where they are.
This will become useful later.</p>
<p>Instead of thinking of where each card <em>goes</em> after the shuffle,
it will be more useful to think of where each card <em>came from</em> before the shuffle.
This way, we’ll represent any shuffle as a function <code class="language-plaintext highlighter-rouge">f(x)</code> which tells us
for each index <code class="language-plaintext highlighter-rouge">x</code> which index <code class="language-plaintext highlighter-rouge">f(x)</code> of the original sequence
we should place at index <code class="language-plaintext highlighter-rouge">x</code> in the new sequence.
Represented this way, the elementary operations have the functions:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id(x) = x mod N = 1 x + 0 mod N
stack(x) = N - x - 1 mod N = (-1) x + (N - 1) mod N
cut(x) = x + n mod N = 1 x + n mod N
deal(x) = x * inv(n) mod N = inv(n) x + 0 mod N
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">inv(n)</code> is the modular multiplicative inverse of <code class="language-plaintext highlighter-rouge">n</code> modulo <code class="language-plaintext highlighter-rouge">N</code>.
Everything will be mod <code class="language-plaintext highlighter-rouge">N</code>,
so I’ll let that be implicit from here on and usually won’t specify <code class="language-plaintext highlighter-rouge">mod N</code> explicitly.</p>
<p>It’s easy to see why <code class="language-plaintext highlighter-rouge">stack</code> and <code class="language-plaintext highlighter-rouge">cut</code> have the functions they do,
but <code class="language-plaintext highlighter-rouge">deal</code> is a little less obvious.
But if we think of it as <code class="language-plaintext highlighter-rouge">i * n</code>, as in the original description of <code class="language-plaintext highlighter-rouge">deal</code>,
we see that we can get <code class="language-plaintext highlighter-rouge">i</code> back by dividing by <code class="language-plaintext highlighter-rouge">n</code>.
In modular integer arithmetic, it’s not always possible to divide
like you can with real or rational numbers,
but if the modulus <code class="language-plaintext highlighter-rouge">N</code> is prime, you can always find a number <code class="language-plaintext highlighter-rouge">inv(n)</code>
such that <code class="language-plaintext highlighter-rouge">n * inv(n) = 1 (mod N)</code>.
So multiplying by <code class="language-plaintext highlighter-rouge">inv(n)</code> gets us back to the index we came from
before a <code class="language-plaintext highlighter-rouge">deal(n)</code> shuffle.</p>
<p>The important feature of expressing the shuffle as these functions
is that functions can be composed.
In particular, you’ll notice that all four functions are
first-degree polynomials.
For these, we can work out compositions analytically:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>f(x) = ax + b
g(x) = cx + d
(f ∘ g)(x) = f(g(x)) = a(cx + d) + b = (ca)x + (ad + b)
(g ∘ f)(x) = g(f(x)) = c(ax + b) + d = (ca)x + (cb + d)
</code></pre></div></div>
<p>So the composition of two first-order polynomials
is always also a first-order polynomial.
This means that since we can express the basic shuffle operations as first-order polynomials,
we can also express our whole shuffle routine as a single first-order polynomial.
This composed polynomial <code class="language-plaintext highlighter-rouge">f(x)</code> will tell us for each index <code class="language-plaintext highlighter-rouge">x</code> after the shuffle,
which index <code class="language-plaintext highlighter-rouge">f(x)</code> from before the shuffle was moved to <code class="language-plaintext highlighter-rouge">x</code>.
All we need to do is start with the <code class="language-plaintext highlighter-rouge">id()</code> polynomial, also known as the identity function,
and successively compose the polynomials for each shuffle operation.</p>
<p>In order to do that, though, there are still a couple of missing pieces.
First, to create the polynomial for a <code class="language-plaintext highlighter-rouge">deal(n)</code> operation
we need to compute the modular multiplicative inverse <code class="language-plaintext highlighter-rouge">inv(n)</code>.
Fortunately the deck size, which is our modulus <code class="language-plaintext highlighter-rouge">N</code>, is not part of the puzzle input
but defined as a specific number for both parts, and both numbers are prime.
This means we can use <a href="https://en.wikipedia.org/wiki/Modular_multiplicative_inverse#Using_Euler's_theorem">Euler’s theorem</a> (one of the many)
to compute the inverse via modular exponentiation:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>inv(n) = pow(n, N - 2) (mod N)
</code></pre></div></div>
<p>For modular exponentiation we can also <a href="https://en.wikipedia.org/wiki/Modular_exponentiation#Right-to-left_binary_method">turn to Wikipedia</a>
to acquire ready-to-go pseudocode.</p>
<p>Next, part 2 asks us to repeat our shuffle
a gigantic number of times: 101,741,582,076,661.
Even with all the above shortcuts,
we can’t just naïvely compose our polynomial with itself that many times.
Instead, we can use <a href="https://en.wikipedia.org/wiki/Exponentiation_by_squaring">exponentiation by squaring</a>:
because function composition is associative,
repeatedly composing a function with itself behaves much like exponentials:
<code class="language-plaintext highlighter-rouge">f ∘ (f ∘ (f ∘ f)) = ((f ∘ f) ∘ f) ∘ f = (f ∘ f) ∘ (f ∘ f)</code>.
This means that we can create “powers of two” of our function
by first composing it with itself, then composing the composition with itself,
then composing that composition with itself, etc.
This way we can simplify repeated self-composition
by breaking the exponent into a sum of powers of two,
and composing the corresponding “powers of two” of our function.
This means we only need to compute approximately <code class="language-plaintext highlighter-rouge">log2(101741582076661)</code> compositions.</p>
<p>Finally, for part 1 we need to find a position after the shuffle
given a position before the shuffle - the opposite of what our polynomials compute.
However, we can easily find this by inverting the polynomial:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>f(x) = ax + b
inv(f)(x) = cx + d
x = (inv(f))(f(x)) = f(inv(f)(x)) = (ca) x + (cb + d)
c = inv(a)
d = -b * inv(a)
(inv(f))(x) = inv(a) * x + ((-b) * inv(a))
</code></pre></div></div>
<p>This lets us easily go the other direction and compute,
for an index <code class="language-plaintext highlighter-rouge">x</code> before the shuffle, the index <code class="language-plaintext highlighter-rouge">f(x)</code> it ends up after the shuffle.</p>
<p>All together, these tricks allow us to solve both parts in 450 μs.</p>
<h2 id="day-23-category-six">Day 23: Category Six</h2>
<p><a href="https://adventofcode.com/2019/day/23">Day 23</a> is again mostly about running an Intcode program in a funny configuration.
<a href="https://github.com/emlun/adventofcode-2019/blob/master/src/days/day23.rs">My solution</a> runs in 2.8 ms
and does nothing special for performance optimization.</p>
<p><em>Update 2020-08-27: See <a href="/advent-of-code-2019/2020/08/27/followup-intcode.html">this post</a> for a discussion
of some Intcode optimizations which were very important for this solution.</em></p>
Wed, 26 Aug 2020 00:00:00 +0200
https://emlun.se/advent-of-code-2019/2020/08/26/day-20-23.html
https://emlun.se/advent-of-code-2019/2020/08/26/day-20-23.htmladvent-of-codealgorithmsdata-structuresenglishprogrammingrustperformanceadvent-of-code-2019Advent of Code 2019 in 130 ms: day 19<h2 id="day-19-tractor-beam">Day 19: Tractor Beam</h2>
<p><a href="https://adventofcode.com/2019/day/19">Day 19</a> is another Intcode challenge,
and in terms of performance optimization it is much like
<a href="/advent-of-code-2019/2020/08/26/day-11-15.html">day 15</a>
in that the focus is on minimizing the number of calls to the Intcode program.
<a href="https://github.com/emlun/adventofcode-2019/blob/master/src/days/day19.rs">My solution</a> uses linear approximation followed by binary search,
and runs in 1.3 ms.</p>
<!--more-->
<p>Part 1 establishes the basic tools we need:
a naïve solution could simply iterate through the whole space and check each point,
but we can work much faster by only exploring the edges of the beam.
Once we know the width of the beam at each y coordinate,
we can easily compute its volume.</p>
<figure id="fig12">
<pre>
0 x ->
0 #.............................
..............................
y ..............................
..#...........................
| ...#..........................
v ....#.........................
....##........................
.....#........................
......#.......................
......##......................
.......##.....................
........##....................
........###...................
.........##...................
..........##..................
..........###.................
...........###................
............###...............
............####..............
.............###..............
..............###.............
..............####............
...............####...........
................####..........
................#####.........
.................####.........
..................####........
..................#####.......
...................#####......
....................#####.....
</pre>
<figcaption>
Figure 12: The tractor beam generated by my puzzle input.
</figcaption>
</figure>
<p>As we can see in <a href="#fig12">figure 12</a>, the beam is roughly cone-shaped,
which I assume to be true for any puzzle input.
At the very beginning there’s a sneaky gap in the beam
which we’ll have to account for,
but we can devise a simple algorithm for following the edges.
To find the right edge of the beam going downward (increasing y),
start at the x coordinate of the edge at the previous row,
and walk to the right while the current coordinate is within the beam.
To find the left edge, similarly start at the previous coordinate
and walk right until the current coordinate is within the beam.
This algorithm is illustrated in <a href="#fig13">figure 13</a>;
note that <code class="language-plaintext highlighter-rouge">x_min</code> is the first coordinate <em>inside</em> the beam
and <code class="language-plaintext highlighter-rouge">x_max</code> is the first coordinate <em>outside</em> the beam.
This makes it easy to compute the beam width.</p>
<figure id="fig13">
<div class="images">
<a href="/advent-of-code-2019/beam-edge.png">
<img src="/advent-of-code-2019/beam-edge.png" />
</a>
</div>
<figcaption>
Figure 13: Tracking the edges of the tractor beam
</figcaption>
</figure>
<p>There’s one edge case where this fails: that sneaky gap at the beginning.
Because of that, if our max-edge algorithm starts on a non-beam coordinate
it will instead walk left until it finds a beam coordinate.
The min-edge algorithm can be simpler:
if we don’t find the beam within, say, 10 steps of the starting coordinate,
just give up and return zero.
With that, for part 1 we just need to find the <code class="language-plaintext highlighter-rouge">x_min</code> and <code class="language-plaintext highlighter-rouge">x_max</code>
for each <code class="language-plaintext highlighter-rouge">y = 0, 1, ..., 49</code>, and compute the sum of <code class="language-plaintext highlighter-rouge">x_max - x_min</code> for each row.</p>
<p>Part 2 gets a little more interesting.
Now we need to find the point <code class="language-plaintext highlighter-rouge">(x0, y0)</code> closest to the origin
such that the square between <code class="language-plaintext highlighter-rouge">(x0, y0)</code> and <code class="language-plaintext highlighter-rouge">(x0 + 100, y0 + 100)</code> fits into the beam.
With a bit of geometric imagination, shown in <a href="#fig14">figure 14</a>,
we can find that any such square must and will satisfy the condition
that both <code class="language-plaintext highlighter-rouge">(x0 + L - 1, y0)</code> and <code class="language-plaintext highlighter-rouge">(x0, y0 + L - 1)</code> are within the beam,
where <code class="language-plaintext highlighter-rouge">L = 100</code> is the desired width of the square.</p>
<figure id="fig14">
<div class="images">
<a href="/advent-of-code-2019/beam-square.png">
<img src="/advent-of-code-2019/beam-square.png" />
</a>
</div>
<figcaption>
Figure 14: Finding a condition for a square to fit in the beam
</figcaption>
</figure>
<p>This means that given a point on the right edge of the beam,
we only need to check a single other point to know whether a square
with its top-right corner at the first point will fit in the beam.
The problem now becomes finding the smallest <code class="language-plaintext highlighter-rouge">y0</code> for which <code class="language-plaintext highlighter-rouge">(x_max - 1, y0)</code>
is such a point.
We can do this with a simple linear search; this took ~14 ms for my program to compute.
But if we model the beam as roughly cone-shaped, we can make an initial guess
to skip ahead to a region where the solution is more likely to be.</p>
<p>Part 1 already has us compute <code class="language-plaintext highlighter-rouge">x_min</code> and <code class="language-plaintext highlighter-rouge">x_max</code> for the first 50 rows.
We can use these values to estimate the slope of the beam edges:
<code class="language-plaintext highlighter-rouge">k1 = x_max / 49</code> for the right edge and <code class="language-plaintext highlighter-rouge">k2 = x_min / 49</code> for the left edge.
This gives us two linear functions:
<code class="language-plaintext highlighter-rouge">x_min(y) = y * k2</code> and <code class="language-plaintext highlighter-rouge">x_max(y) = y * k1</code>.
If we combine these with the square criterion, we get</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x_max(y0) - x_min(y0 + L - 1) >= L
y0 * k1 - (y0 + L - 1) * k2 >= L
y0 * k1 - y0 * k2 - (L - 1) * k2 >= L
y0 * (k1 - k2) >= L + (L - 1) * k2
y0 * (k1 - k2) >= L + L * k2 - k2
y0 * (k1 - k2) >= L * (1 + k2) - k2
y0 >= (L * (1 + k2) - k2) / (k1 - k2)
</code></pre></div></div>
<p>This gives us a fairly good first approximation of <code class="language-plaintext highlighter-rouge">y0</code>, which we can now refine.
We can just linear search from this guess forward or backward,
depending on whether our square fits at the guessed <code class="language-plaintext highlighter-rouge">y0</code>;
this takes my program about 2.5 ms.
Even faster, though, is a binary search.
We can start with 50 and our guess times 2 as the lower and upper limits.
We check the middle point and set either the upper or the lower limit
equal to the middle point, depending on if our guess was too high or too low.
We repeat this until the upper and lower limits meet,
which takes <code class="language-plaintext highlighter-rouge">log2(max - min)</code> steps.
This brings us down to 1.3 ms total for both parts.</p>
Wed, 26 Aug 2020 00:00:00 +0200
https://emlun.se/advent-of-code-2019/2020/08/26/day-19.html
https://emlun.se/advent-of-code-2019/2020/08/26/day-19.htmladvent-of-codealgorithmsdata-structuresenglishprogrammingrustperformanceadvent-of-code-2019Advent of Code 2019 in 130 ms: day 18<h2 id="day-18-many-worlds-interpretation">Day 18: Many-Worlds Interpretation</h2>
<p><a href="https://adventofcode.com/2019/day/18">Challenge 18</a> also took quite a bit of effort to optimize.
Basically, it’s a maze optimization problem,
but with the added twist that you need to collect keys to open doors,
and in part 2 you’re exploring 4 partitions of the map in parallel.</p>
<!--more-->
<p><a href="https://github.com/emlun/adventofcode-2019/blob/master/src/days/day18.rs">My solution</a> runs in 11 ms,
and gets its performance from a lazily constructed abstracted world map,
a custom collection type storing keys in a bit field,
and a compact representation for detecting duplicate states.</p>
<h3 id="the-abstract-graph">The abstract graph</h3>
<p>The abstracted world map is a graph of shortest paths between keys,
as shown in <a href="#fig08">figure 8</a>.
This greatly reduces the size of the search space
compared to the basic world map,
as one “step” in the graph can represent hundreds of steps on the actual map.</p>
<figure id="fig08">
<div class="images">
<a href="/advent-of-code-2019/abstract-map.png">
<img src="/advent-of-code-2019/abstract-map.png" />
</a>
</div>
<figcaption>
Figure 8: Part of the abstracted graph of distances between keys.
The full graph contains connections between all pairs of keys.
<code>@</code> marks the starting position, not a key,
so no key has an entry navigating to <code>@</code>,
but <code>@</code> does have an entry navigating to each key.
</figcaption>
</figure>
<p>This abstract map assumes all doors are open,
but each connection records which doors and keys it passes.
This way we can reuse the map regardless of what keys we have currently collected.
In code, the graph is represented as a hash map,
mapping points to vectors of <em>route</em> objects.
A route object contains a length, the end point,
and the keys and doors along the path.
For example, the map for <a href="#fig08">figure 8</a> has an entry
mapping <code class="language-plaintext highlighter-rouge">(6, 3)</code> to a vector containing the route</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
to: (1, 7),
length: 13,
keys: ['l'],
doors: ['F']
}
</code></pre></div></div>
<p>For maximum performance, this map is built lazily on demand
the first time we need to navigate from each point.
This is done by a simple exhaustive breadth-first search (BFS) through the basic map.</p>
<p>Using this higher-level map,
the real work is done using <a href="https://en.wikipedia.org/wiki/Dijkstra's_algorithm">Dijkstra’s algorithm</a>
to find the shortest path that visits all keys.
This is similar to a basic BFS,
but where steps can have different lengths.
Like BFS, it uses a queue of locations to explore,
but unlike basic BFS, it is a <a href="https://en.wikipedia.org/wiki/Priority_queue">priority queue</a>
which sorts the locations by the total number of steps taken.
This ensures that the currently shortest path is always processed first,
even if it has more (but individually shorter) steps than other paths.</p>
<p>The states for this Dijkstra search have a current position,
a set of collected keys, and a total number of steps taken.
We initialize the queue with a state at the starting position with no keys collected.
For each state in the queue,
we use the abstract map to get the list of routes from the current position,
and generate a new state for each route which contains keys we have not already collected,
and does not pass any doors for which the current state hasn’t yet collected the keys.
<a href="#fig09">Figure 9</a> illustrates how new states are generated
by filtering the routes in the abstract map.</p>
<figure id="fig09">
<div class="images">
<a href="/advent-of-code-2019/dijkstra-new-states.png">
<img src="/advent-of-code-2019/dijkstra-new-states.png" />
</a>
</div>
<figcaption>
Figure 9: Filtering the abstract map for routes
which the current state has enough keys for,
and which lead to keys not collected by the current state.
</figcaption>
</figure>
<p>An important note here is that the abstract map includes routes
that pass more than one key.
For a long time, my solution only allowed one key per route,
my thinking being that this reduces the branching factor;
when I removed this restriction, my runtime was reduced by 80%.</p>
<h3 id="key-sets-as-bit-fields">Key sets as bit fields</h3>
<p>The abstract map is the main algorithmic trick, but for it to run quickly
there’s a fair bit of tricks to improve implementation efficiency.
The algorithm involves a lot of comparing sets of keys
to make sure we don’t explore paths we don’t have the keys for,
as well as copying sets of keys to new states.
This can be made much more efficient by representing key sets as a bit field
instead of hash maps.
This turns most set operations - union, intersection, subset check, etc. -
into a single bitwise AND or OR operation,
and copying a set is practically free since it’s stored in a single integer value.</p>
<p>My solution therefore converts the key characters in the input map
into key IDs being powers of two,
and represents key sets as bitwise OR combinations of key IDs,
as shown in <a href="#fig10">figure 10</a>.
This optimization alone reduces runtime by about 75%.</p>
<figure id="fig10">
<pre>
z...gfedcba z...hgfedcba
----------- ------------
a, A => 0...0000001 {a} => 0...00000001
b, B => 0...0000010 {a, b} => 0...00000011
c, C => 0...0000100 {c, e} => 0...00010100
d, D => 0...0001000
e, E => 0...0010000 {a, c, e} ∪ {b, c, g} => 0...01010111
f, F => 0...0100000 {a, c, e} ∩ {b, c, g} => 0...00000010
...
</pre>
<figcaption>
Figure 10: Representing key and door IDs as singleton bit fields,
and sets of keys as bitwise OR combinations of key IDs.
</figcaption>
</figure>
<h3 id="compact-duplication-keys">Compact duplication keys</h3>
<p>A Dijkstra search needs to keep track of
which locations it has already found a shorter path for,
so that it can discard states from the queue
if a shorter path was found after the state was added to the queue.
In this case we need to take keys into account for this,
so a simple solution is a hash map mapping pairs of (keys collected, position)
to the length of shortest path to get there.
Again, though, this involves a lot of manipulation of large values.
To speed this up,
my solution instead uses a hash map of <em>duplication keys</em> to path lengths.
The duplication key for a state is the set of collected keys and the position -
or positions, for part 2 -
packed into a single <code class="language-plaintext highlighter-rouge">u128</code> value as shown in <a href="#fig11">figure 11</a>.
This is easy to compute since key sets are already represented as <code class="language-plaintext highlighter-rouge">u32</code> bit fields.
This saves about 15% of run time,
but does assume that the map is no larger than 4096x4096 tiles.</p>
<figure id="fig11">
<pre>
Keys: [a d n s uv]
Positions: [(13, 39), (5, 45), (53, 43), (41, 39)]
Duplication key:
zyxvutsrqponmlkjihgfedcba
-----------------------------
0...0001101000010000000001001
y0 = 39 x0 = 13 y1 = 45 x1 = 5
|----------| |----------| |----------| |----------|
000000100111 000000001101 000000101101 000000000101
y2 = 43 x2 = 53 y3 = 39 x3 = 41
|----------| |----------| |----------| |----------|
000000101011 000000110101 000000100111 000000101001
</pre>
<figcaption>
Figure 11: Example key set and 4 points
encoded as a single 128-bit duplication key value
</figcaption>
</figure>
<h3 id="part-2">Part 2</h3>
<p>For part 2, we need to explore 4 partitions of the maze in parallel,
while keys may unlock doors in other partitions.
This turns out to be very easy to incorporate -
the basic breadth-first search to build the abstract map doesn’t change at all,
and the states of the higher-level Dijkstra search
simply have four positions instead of just one,
and generate new states from each position.
The duplication keys still fit comfortably in a <code class="language-plaintext highlighter-rouge">u128</code>
since we’re guaranteed to not have to deal with more than 4 positions -
again under the assumption that the maze is at most 4096x4096 tiles.
In fact, having a <code class="language-plaintext highlighter-rouge">Vec<Point></code> instead of just a <code class="language-plaintext highlighter-rouge">Point</code>
in the key for the map of shortest lengths
was my motivation for trying the duplication key method.</p>
Wed, 26 Aug 2020 00:00:00 +0200
https://emlun.se/advent-of-code-2019/2020/08/26/day-18.html
https://emlun.se/advent-of-code-2019/2020/08/26/day-18.htmladvent-of-codealgorithmsdata-structuresenglishprogrammingrustperformanceadvent-of-code-2019Advent of Code 2019 in 130 ms: day 17<h2 id="day-17-set-and-forget">Day 17: Set and Forget</h2>
<p><a href="https://adventofcode.com/2019/day/17">Day 17</a> is an interesting exercise in sequence compression,
but the runtime optimization of <a href="https://github.com/emlun/adventofcode-2019/blob/master/src/days/day17.rs">my solution</a>,
running in 1.9 ms, is more about implementation efficiency
and a fast Intcode engine than about algorithmic sophistication.</p>
<!--more-->
<p>Part 1 is about finding intersections on an ASCII map of tiles.
My solution is rather simple: it runs the Intcode program
and reads the map into a hash set of the scaffold tiles,
then simply iterates through all the tiles to find the ones
with more than 2 neighbors.
Simple linear time complexity with nothing fancy going on.</p>
<p>Part 2 gets quite a bit more complicated.
We need to find a path that visits every scaffold tile at least once,
and then compress the path to fit in at most 20 ASCII characters.
The <em>movement functions</em> act as the compression code book:
a list of building blocks we can use to assemble the
<em>main movement routine</em> by referencing the movement functions by index.</p>
<p>My solution begins by finding the simplest eligible path.
By inspecting the map generated by my puzzle input,
shown in <a href="#fig07">figure 7</a>,
I noticed that I can simply move forward until I encounter a gap,
then turn whichever of left and right is available -
there are no “T” intersections on this map,
so this will visit every tile on the map.</p>
<figure id="fig07">
<pre>
000000000011111111112222222222333333333344444444445
012345678901234567890123456789012345678901234567890
0 ......................#######......................
1 ......................#.....#......................
2 ..................####O##...#......................
3 ..................#...#.#...#......................
4 #########.........#...#.#...#......................
5 #.......#.........#...#.#...#......................
6 #.......#.########O####.#...#......................
7 #.......#.#.......#.....#...#......................
8 #.......#.#.......#...##O####......................
9 #.......#.#.......#.....#..........................
10 #######.##O######.######O##........................
11 ......#...#.....#.......#.#........................
12 ......#...#.....#.......#.#........................
13 ......#...#.....#.......#.#........................
14 ......#...#.####O########.#.........#######........
15 ......#...#.#...#.........#.........#.....#........
16 ......#...#.#...#.........#########.#.....#........
17 ......#...#.#...#.................#.#.....#........
18 ......#...##O####.................#.#.....#........
19 ......#.....#.....................#.#.....#........
20 ......#.....#.....................#.#.....#........
21 ......#.....#.....................#.#.....#........
22 ......#######...........R#########O##.####O####....
23 ..................................#...#...#...#....
24 ..................................#...#...#...#....
25 ..................................#...#...#...#....
26 ..................................#...#...####O####
27 ..................................#...#.......#...#
28 ..................................####O########...#
29 ......................................#...........#
30 ................................#######...........#
31 ................................#.................#
32 ................................#.....#############
33 ................................#.....#............
34 ................................#.....#............
35 ................................#.....#............
36 ................................#.....#............
37 ................................#.....#............
38 ................................#######............
</pre>
<figcaption>
Figure 7: The map of the scaffolds for my puzzle input.
<code>#</code> are scaffold tiles, <code>.</code> are empty,
<code>O</code> are intersections
and <code>R</code> is the robot's starting location.
</figcaption>
</figure>
<p>This results in the following path:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1),
L(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1), R(1), F(1), F(1), F(1),
F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1),
F(1), F(1), L(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1), R(1), F(1),
F(1), F(1), F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1),
F(1), F(1), F(1), F(1), L(1), F(1), F(1), F(1), F(1), F(1), R(1), F(1),
F(1), F(1), F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1),
R(1), F(1), F(1), F(1), F(1), F(1), L(1), F(1), F(1), F(1), F(1), F(1),
F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1), R(1), F(1),
F(1), F(1), F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1),
F(1), F(1), F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1),
F(1), F(1), F(1), F(1), L(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1),
R(1), F(1), F(1), F(1), F(1), F(1), L(1), F(1), F(1), F(1), F(1), F(1),
F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1), R(1), F(1),
F(1), F(1), F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1),
F(1), F(1), F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1),
F(1), F(1), F(1), F(1), L(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1),
R(1), F(1), F(1), F(1), F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1),
F(1), F(1), F(1), F(1), F(1), F(1), L(1), F(1), F(1), F(1), F(1), F(1),
R(1), F(1), F(1), F(1), F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1),
F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1), L(1), F(1), F(1), F(1),
F(1), F(1), F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1), F(1), F(1),
R(1), F(1), F(1), F(1), F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1),
F(1), F(1), F(1), F(1), F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1),
F(1), F(1), F(1), F(1), F(1), F(1), L(1), F(1), F(1), F(1), F(1), F(1),
R(1), F(1), F(1), F(1), F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1),
F(1), F(1), R(1), F(1), F(1), F(1), F(1), F(1)
</code></pre></div></div>
<p>Here <code class="language-plaintext highlighter-rouge">R(n)</code>, <code class="language-plaintext highlighter-rouge">L(n)</code> and <code class="language-plaintext highlighter-rouge">F(n)</code> respectively mean turn left, right,
or don’t turn, and then move <code class="language-plaintext highlighter-rouge">n</code> steps forward.
The next step is to compress this to a more compact representation
by simply merging the <code class="language-plaintext highlighter-rouge">F(n)</code> steps into the preceding non-<code class="language-plaintext highlighter-rouge">F</code> step:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R(12), L(8), R(6), R(12), L(8), R(6), R(12), L(6), R(6), R(8), R(6),
L(8), R(8), R(6), R(12), R(12), L(8), R(6), L(8), R(8), R(6), R(12),
R(12), L(8), R(6), R(12), L(6), R(6), R(8), R(6), L(8), R(8), R(6),
R(12), R(12), L(6), R(6), R(8), R(6)
</code></pre></div></div>
<p>After that, the code gets a bit convoluted, but it goes something like this:</p>
<p>First, find the longest prefix that appears again later in the sequence.
In the above sequence, this works out to <code class="language-plaintext highlighter-rouge">[R(12), L(8), R(6), R(12)]</code>.
Then remove that prefix from the front and do the same thing two more times
so we have three subsequences, one for each movement function.
This gives us the subsequences
<code class="language-plaintext highlighter-rouge">[L(8), R(6), R(12), L(6), R(6), R(8), R(6), L(8), R(8), R(6), R(12), R(12)]</code>
and <code class="language-plaintext highlighter-rouge">[L(8), R(6)]</code>.</p>
<p>Next we try to find a “covering” of the original sequence using
the three subsequences.
We do this by checking if any of the subsequences is a prefix of the original sequence,
and if it is,
remove it from the front and recursively try to find a covering
for the rest of the sequence.
If we manage to exhaust the sequence by these recursive calls,
that means we covered the whole sequence,
and we collect a list of the subsequence indices we used to build the covering.
For example, with the above sequence and subsequences, it would proceed like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>subsequences = [
[R(12), L(8), R(6), R(12)],
[L(8), R(6), R(12), L(6), R(6), R(8), R(6), L(8), R(8), R(6), R(12), R(12)],
[L(8), R(6)],
]
sequence = [
R(12), L(8), R(6), R(12), L(8), R(6), R(12), L(6), R(6), R(8), R(6),
L(8), R(8), R(6), R(12), R(12), L(8), R(6), L(8), R(8), R(6), R(12),
R(12), L(8), R(6), R(12), L(6), R(6), R(8), R(6), L(8), R(8), R(6),
R(12), R(12), L(6), R(6), R(8), R(6)
]
covering = []
Try subsequence 0: success!
covering = [0]
sequence = [
L(8), R(6), R(12), L(6), R(6), R(8), R(6),
L(8), R(8), R(6), R(12), R(12), L(8), R(6), L(8), R(8), R(6), R(12),
R(12), L(8), R(6), R(12), L(6), R(6), R(8), R(6), L(8), R(8), R(6),
R(12), R(12), L(6), R(6), R(8), R(6)
]
Try subsequence 0: failure
Try subsequence 1: success!
covering = [0, 1]
sequence = [
L(8), R(6), L(8), R(8), R(6), R(12),
R(12), L(8), R(6), R(12), L(6), R(6), R(8), R(6), L(8), R(8), R(6),
R(12), R(12), L(6), R(6), R(8), R(6)
]
Try subsequence 0: failure
Try subsequence 1: failure
Try subsequence 2: success!
covering = [0, 1, 2]
sequence = [
L(8), R(8), R(6), R(12),
R(12), L(8), R(6), R(12), L(6), R(6), R(8), R(6), L(8), R(8), R(6),
R(12), R(12), L(6), R(6), R(8), R(6)
]
Try subsequence 0: failure
Try subsequence 1: failure
Try subsequence 2: failure
</code></pre></div></div>
<p>So these three subsequences were not able to cover the original sequence.
We now proceed by removing one element from the back of the last subsequence,
changing <code class="language-plaintext highlighter-rouge">[L(8), R(6)]</code> to <code class="language-plaintext highlighter-rouge">[L(8)]</code>, and try finding a covering again.
This also fails, so we again remove one element from the back.
The third subsequence is now empty, so we remove it
and remove one element from the new last subsequence,
changing <code class="language-plaintext highlighter-rouge">[L(8), R(6), R(12), L(6), R(6), R(8), R(6), L(8), R(8), R(6), R(12), R(12)]</code>
to <code class="language-plaintext highlighter-rouge">[L(8), R(6), R(12), L(6), R(6), R(8), R(6), L(8), R(8), R(6), R(12)]</code>.
Now we go back to the start and find a new longest repeated prefix,
and add that as the third subsequence.
We now have the subsequences:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[R(12), L(8), R(6), R(12)]
[L(8), R(6), R(12), L(6), R(6), R(8), R(6), L(8), R(8), R(6), R(12)]
[R(12), L(8), R(6)]
</code></pre></div></div>
<p>and we continue trying to find a covering until we’ve reduced
also the second sequence to just <code class="language-plaintext highlighter-rouge">[L(8)]</code>.
After that fails, we similarly remove the last element from the first subsequence,
replenish the last two subsequences with new longest prefixes, and continue the search.
Eventually we end up with the subsequences</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[R(12), L(8), R(6)]
[R(12), L(6), R(6), R(8), R(6)]
[L(8), R(8), R(6), R(12), R(12), L(8), R(6)]
</code></pre></div></div>
<p>with which we do manage to find a covering: <code class="language-plaintext highlighter-rouge">[0, 0, 1, 2, 0, 2, 0, 1, 2, 1]</code>,
meaning subsequence 0 twice, subsequence 1 and 2 once each, and so on.</p>
<p>And, well, that turned out to work just fine for my puzzle input.
I don’t know if other people’s puzzle inputs require you
to break the sequence into smaller segments,
but by the “you’re probably not special” principle, I’m guessing not.
Either way, my solution also works for the uncompressed path
but takes significantly more attempts to find a set of covering subsequences -
1990 attempts on my puzzle input, compared to 19 attempts with the compressed path,
although it only takes about 10 times longer time to run.</p>
<p>I don’t know if there’s a better way to do this,
but since my runtime is already less than 2 ms,
I haven’t really tried to find one
since there was more room for improvement in other solutions.</p>
Wed, 26 Aug 2020 00:00:00 +0200
https://emlun.se/advent-of-code-2019/2020/08/26/day-17.html
https://emlun.se/advent-of-code-2019/2020/08/26/day-17.htmladvent-of-codealgorithmsdata-structuresenglishprogrammingrustperformanceadvent-of-code-2019Advent of Code 2019 in 130 ms: day 16<h2 id="day-16-flawed-frequency-transmission">Day 16: Flawed Frequency Transmission</h2>
<p><a href="https://adventofcode.com/2019/day/16">Challenge 16</a> is probably the one I’ve spent the most time on optimizing,
but <a href="https://github.com/emlun/adventofcode-2019/blob/master/src/days/day16.rs">my solution</a> still takes 8.6 ms to run.
There’s quite a lot to unpack here, so this one gets a whole post of its own!</p>
<!--more-->
<p>Let <code class="language-plaintext highlighter-rouge">d(p, i)</code> be the <code class="language-plaintext highlighter-rouge">i</code>th signal digit after <code class="language-plaintext highlighter-rouge">p</code> phases,
where <code class="language-plaintext highlighter-rouge">p = 0, 1, ..., 100</code> and <code class="language-plaintext highlighter-rouge">i = 0, 1, ..., (L - 1)</code> with <code class="language-plaintext highlighter-rouge">L</code> as the number of digits.
<code class="language-plaintext highlighter-rouge">d(0, ...)</code> is the puzzle input, and <code class="language-plaintext highlighter-rouge">d(p + 1, ...)</code> is computed from <code class="language-plaintext highlighter-rouge">d(p, ...)</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>d(p + 1, i) = sum(d(p, j) * k(i, j) for j = 0, 1, ..., (L - 1))
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">k(i, j)</code> is the repeating pattern of multipliers.
To generate <code class="language-plaintext highlighter-rouge">k(i, ...)</code>, we start with the base sequence <code class="language-plaintext highlighter-rouge">0, 1, 0, -1</code> repeating infinitely.
<code class="language-plaintext highlighter-rouge">k(0, ...)</code> takes this sequence unmodified,
<code class="language-plaintext highlighter-rouge">k(1, ...)</code> stretches it so each element appears twice in a row,
<code class="language-plaintext highlighter-rouge">k(2, ...)</code> stretches to three times in a row, etc.,
and finally each <code class="language-plaintext highlighter-rouge">k(i, ...)</code> ignores the very first element.
We can visualize this as a matrix as shown in <a href="#fig04">figure 4</a>.</p>
<figure id="fig04">
<pre>
j | - | 0 1 2 3 4 5 6 7 8 9 10 ...
i k(i, j) | |
-----------+---+-------------------------------------------
0 | 0 | 1 0 -1 0 1 0 -1 0 1 0 -1
1 | 0 | 0 1 1 0 0 -1 -1 0 0 1 1
2 | 0 | 0 0 1 1 1 0 0 0 -1 -1 -1
3 | 0 | 0 0 0 1 1 1 1 0 0 0 0
4 | 0 | 0 0 0 0 1 1 1 1 1 0 0
5 | 0 | 0 0 0 0 0 1 1 1 1 1 1
6 | 0 | 0 0 0 0 0 0 1 1 1 1 1
7 | 0 | 0 0 0 0 0 0 0 1 1 1 1
8 | 0 | 0 0 0 0 0 0 0 0 1 1 1
9 | 0 | 0 0 0 0 0 0 0 0 0 1 1
10 | 0 | 0 0 0 0 0 0 0 0 0 0 1
...
</pre>
<figcaption>
Figure 4: The multiplier pattern
</figcaption>
</figure>
<p>For part 1, we need to compute <code class="language-plaintext highlighter-rouge">d(100, 0, 1, ..., 7)</code>.
My solution doesn’t do anything particularly fancy,
but it uses a few tricks to avoid computing multiplications.
First, it computes the contributions from positive and negative terms
separately and completely skips the zero terms.
By inspecting <a href="#fig04">figure 4</a> we can work out that the pattern is
that positive terms appear in runs of <code class="language-plaintext highlighter-rouge">i + 1</code> elements
every <code class="language-plaintext highlighter-rouge">(i + 1) * 4</code> terms, starting with the <code class="language-plaintext highlighter-rouge">i</code>th,
and negative terms appear in runs of <code class="language-plaintext highlighter-rouge">i + 1</code> elements
every <code class="language-plaintext highlighter-rouge">(i + 1) * 4</code> terms, starting with the <code class="language-plaintext highlighter-rouge">i + (i + 1) * 2</code>th.
In Rust, this can be expressed fairly easily using the <code class="language-plaintext highlighter-rouge">step_by</code>,
<code class="language-plaintext highlighter-rouge">flat_map</code> and <code class="language-plaintext highlighter-rouge">sum</code> methods of the <code class="language-plaintext highlighter-rouge">Range</code> type.
This optimization approximately halves runtime
compared to computing the multiplication for each element.</p>
<p>We can go further, though:
we can see in <a href="#fig04">figure 4</a> that if <code class="language-plaintext highlighter-rouge">i</code> is past <code class="language-plaintext highlighter-rouge">L / 2</code>,
then the multipliers will be 0 for <code class="language-plaintext highlighter-rouge">j < i</code> and 1 for <code class="language-plaintext highlighter-rouge">j >= i</code>,
so we can simplify the formula to</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>d(p + 1, i) = sum(d(p, j) for j = i, i + 1, ..., (L - 1)) mod 10
where i >= floor(L / 2)
</code></pre></div></div>
<p>Similarly, if <code class="language-plaintext highlighter-rouge">i</code> is past <code class="language-plaintext highlighter-rouge">L / 3</code>,
then the multipliers are just a stretch of ones on the middle third:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>d(p + 1, i) = sum(d(p, j) for j = i, i + 1, ..., (i + i)) mod 10
where i >= floor(L / 3) and i < floor(L / 2)
</code></pre></div></div>
<p>Finally, if <code class="language-plaintext highlighter-rouge">i</code> is past <code class="language-plaintext highlighter-rouge">L / 4</code>,
then the multipliers are one stretch of ones and one stretch of negative ones:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>d(p + 1, i) = ( sum(d(p, j) for j = i, i + 1, ..., (i + i))
- sum(d(p, j) for j = (3i + 2), (3i + 3), ..., (3i + 2 + i))
) mod 10
where i >= floor(L / 4) and i < floor(L / 3)
</code></pre></div></div>
<p>These three additional optimizations approximately halves the runtime again,
bringing us down to about one fourth the runtime
compared to the multiplication method.</p>
<p>Also interesting is that although we’re working with single-digit numbers,
it turns out to be faster to store the digits as <code class="language-plaintext highlighter-rouge">i32</code> rather than <code class="language-plaintext highlighter-rouge">i16</code> or <code class="language-plaintext highlighter-rouge">i8</code>.
With <code class="language-plaintext highlighter-rouge">i8</code> you need to compute a modulo operation after each addition,
while with <code class="language-plaintext highlighter-rouge">i16</code> and <code class="language-plaintext highlighter-rouge">i32</code> you can compute the sum over all the digits
and the modulo afterwards.
I’m guessing the reason <code class="language-plaintext highlighter-rouge">i32</code> is slightly faster than <code class="language-plaintext highlighter-rouge">i16</code> is that
my processor is more optimized for 32-bit than 16-bit arithmetic.
Furthermore, Rust’s <code class="language-plaintext highlighter-rouge">Iterator::sum</code> turns out to be much faster
than summing with <code class="language-plaintext highlighter-rouge">Iterator::fold</code>.</p>
<h3 id="part-2">Part 2</h3>
<p>For part 2, it gets more complicated.
The digit sequence is now repeated 10,000 times,
which means our part 1 solution would take 100,000,000 times longer to run
since every digit of every phase depends on every other digit in the previous phase.
We need to be a lot smarter about this one.</p>
<p>Fortunately, this time we’re not computing the first 8 digits of the 100th phase,
but the first 8 digits starting at an offset defined by the puzzle input.
The message offset is the first 7 digits of the puzzle input,
which in my case is 5,975,093.
My puzzle input is 650 digits long, so the digit sequence is 6,500,000 digits long.
A crucial observation here is that the offset is past half the digit sequence.
If we assume this will always be the case, we can take some huge shortcuts.</p>
<p>Recall that in <a href="#fig04">figure 4</a>, the lower half of the matrix
is all ones above the diagonal, and all zeroes below the diagonal -
what’s known as an <em>upper triangular matrix</em>.
If we expand the formulae for each digit in the next phase,
starting from the end, we get this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>d(p + 1, 10) = d(p, 10)
d(p + 1, 9) = d(p, 9) + d(p, 10)
d(p + 1, 8) = d(p, 8) + d(p, 9) + d(p, 10)
d(p + 1, 7) = d(p, 7) + d(p, 8) + d(p, 9) + d(p, 10)
</code></pre></div></div>
<p>If we proceed a few more phases, we get <a href="#fig05">figure 5</a>:</p>
<figure id="fig05">
<pre>
d(p + 2, 10) = d(p+1, 10)
= 1 d(p, 10)
d(p + 2, 9) = d(p+1, 9) + d(p+1, 10)
= 1 d(p, 9) + 2 d(p, 10)
d(p + 2, 8) = d(p+1, 8) + d(p+1, 9) + d(p+1, 10)
= 1 d(p, 8) + 2 d(p, 9) + 3 d(p, 10)
d(p + 2, 7) = d(p+1, 7) + d(p+1, 8) + d(p+1, 9) + d(p+1, 10)
= 1 d(p, 7) + 2 d(p, 8) + 3 d(p, 9) + 4 d(p, 10)
d(p + 3, 10) = d(p+2, 10)
= 1 d(p, 10)
d(p + 3, 9) = d(p+2, 9) + d(p+2, 10)
= 1 d(p, 9) + 3 d(p, 10)
d(p + 3, 8) = d(p+2, 8) + d(p+2, 9) + d(p+2, 10)
= 1 d(p, 8) + 3 d(p, 9) + 6 d(p, 10)
d(p + 3, 7) = d(p+2, 7) + d(p+2, 8) + d(p+2, 9) + d(p+2, 10)
= 1 d(p, 7) + 3 d(p, 8) + 6 d(p, 9) + 10 d(p, 10)
d(p + 4, 10) = d(p+3, 10)
= 1 d(p, 10)
d(p + 4, 9) = d(p+3, 9) + d(p+3, 10)
= 1 d(p, 9) + 4 d(p, 10)
d(p + 4, 8) = d(p+3, 8) + d(p+3, 9) + d(p+3, 10)
= 1 d(p, 8) + 4 d(p, 9) + 10 d(p, 10)
d(p + 4, 7) = d(p+3, 7) + d(p+3, 8) + d(p+3, 9) + d(p+3, 10)
= 1 d(p, 7) + 4 d(p, 8) + 10 d(p, 9) + 20 d(p, 10)
</pre>
<figcaption>
Figure 5: Expansion for a few phases of the last few digits in a sequence of 10
</figcaption>
</figure>
<p>Here we see a couple of sequences appear: <code class="language-plaintext highlighter-rouge">1, 2, 3, 4</code>, <code class="language-plaintext highlighter-rouge">1, 3, 6, 10</code>, <code class="language-plaintext highlighter-rouge">1, 4, 10, 20</code>.
If those seem familiar,
it’s because those are exactly the diagonals of <a href="https://en.wikipedia.org/wiki/Pascal%27s_triangle">Pascal’s triangle</a>.
This means that if we denote as <code class="language-plaintext highlighter-rouge">P(p, i)</code>
the <code class="language-plaintext highlighter-rouge">i</code>th element (starting from <code class="language-plaintext highlighter-rouge">P(p, 0) = 1</code>),
of the <code class="language-plaintext highlighter-rouge">p</code>th diagonal (starting from <code class="language-plaintext highlighter-rouge">P(0, ...) = 1, 1, 1, ...</code>)
of Pascal’s triangle,
we can compute <code class="language-plaintext highlighter-rouge">d(p, i)</code> as</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>d(p, i) = sum(P(p, j) * d(0, i + j) for j = 0, 1, ..., (L - 1 - i)) mod 10
where i >= floor(L / 2)
</code></pre></div></div>
<p>Since we’re working modulo 10, we also only need to compute <code class="language-plaintext highlighter-rouge">P(p, j)</code> mod 10,
which means we won’t have any overflow issues despite <code class="language-plaintext highlighter-rouge">P(p, j)</code> growing very quickly
for large <code class="language-plaintext highlighter-rouge">p</code> or <code class="language-plaintext highlighter-rouge">j</code>.
This reduces the number of operations from
100 * (650 * 10,000)<sup>2</sup> = 4,225,000,000,000,000
to about 100 * 8 * (6,500,000 - 5,975,093) = 419,925,600,
a factor 10 million difference.
This is enough to bring runtime down to a quite feasible ~280 ms,
unlike the naïve method which would take many, many hours.
I believe that finding this first trick, or something similar,
is necessary to solve part 2 at all.
But there are a few more tricks we can use to go even faster!</p>
<p>Even with the previous trick, we still need to compute
524,907 elements each of 100 diagonals of Pascal’s triangle.
There are methods to compute a diagonal on its own without needing the previous ones,
but they rely on multiplication to generate one element from the previous -
and repeated multiplication does not play well under modulo like addition does,
so you quickly run into overflow issues.
I was able to work around this by representing numbers by their prime factorization,
which means you can multiply by simply adding the exponentials,
but this turned out to be slower than just computing each diagonal from the previous by addition.</p>
<p>So let’s take a look at what <code class="language-plaintext highlighter-rouge">P(100, ...) mod 10</code> looks like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1, 0, 0, 0, 5, 0, 0, 0, 5, 0, 0, 0, 5, 0, 0, 0, 5, 0, 0, 0, 5, 0, 0, 0, 5, 4, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
</code></pre></div></div>
<p>Pretty sparse. Maybe we can find some pattern in there?
If we plot <code class="language-plaintext highlighter-rouge">P(100, 0 ... 100000) mod 10</code> we get <a href="#fig06">figure 6</a>:</p>
<figure id="fig06">
<div class="images">
<a href="/advent-of-code-2019/pascal-d100.png">
<img src="/advent-of-code-2019/pascal-d100.png" />
</a>
</div>
<figcaption>
Figure 6: The first 100,000 elements of the 100th diagonal of Pascal's triangle modulo 10.
</figcaption>
</figure>
<p>The bands at values 1 and 9 suggest that this sequence may be periodic.
So I cobbled together a simple MATLAB/<a href="https://www.gnu.org/software/octave/">Octave</a> script to check:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function [ P ] = find_period(seq)
P = -1;
offset = 0;
for period = 1:floor((length(seq)+1-offset)/2)
repetitions = floor(length(seq) / period);
if all(seq == [repmat(seq(1:period), 1, repetitions), seq(1:(mod(length(seq), period)))])
P = period;
break
end
end
end
</code></pre></div></div>
<p>and it turns out that <code class="language-plaintext highlighter-rouge">P(100, ...) mod 10</code> is indeed periodic with 16,000 elements!
This means we only need to compute 100 * 16,000 elements instead of 100 * 524,907,
which reduces runtime for part 2 from ~280 ms to ~28 ms.</p>
<p>The next step is to make even more use of this periodicity:
since our digit sequence is also periodic with 650 elements,
this means the sequences of products are also periodic
with at most <code class="language-plaintext highlighter-rouge">lcm(650, 16000) = 208000</code> elements.
We have 524,907 digits to process, which is about two and a half cycles,
so we only need to compute the first and last cycles.
We can thus reduce our formula to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>c = 16000
C = lcm(c, 650) = 208000
N = floor((L - message_offset) / C)
d(p, i) = C * sum(P(p, j mod c) * d(0, i + j) for j = 0, 1, ..., (C - 1))
+ sum(P(p, j mod c) * d(0, i + j) for j = N * C, N * C + 1, ..., (L - 1 - i))
mod 10
where i >= floor(L / 2)
</code></pre></div></div>
<p>This saves about an additional third of runtime, bringing part 2 down to ~19 ms.
For my puzzle input it happens that the two full cycles sum to zero,
so it might be possible to eliminate those altogether,
but I haven’t been able to prove this will always be true.
Anyway, some more implementation optimizations further reduce the time to ~11 ms:
storing digits as <code class="language-plaintext highlighter-rouge">i32</code>, eliminating unnecessary intermediate <code class="language-plaintext highlighter-rouge">Vec</code> allocations,
hard-coding the number of phases (100) and the period of Pascal’s triangle (16,000),
and using <code class="language-plaintext highlighter-rouge">Iterator::sum</code> instead of <code class="language-plaintext highlighter-rouge">fold</code>.</p>
<p>As a final performance optimization, we can abandon good taste
and hard-code the 100th diagonal of Pascal’s triangle.
We can do this since the number of phases is specifically defined in the problem statement,
because Pascal’s triangle is of course the same for any puzzle input,
and because an array of 16,000 elements is after all quite small.
This brings part 2 runtime down to ~3.2 ms,
and both parts down to 8.6 ms total.</p>
Wed, 26 Aug 2020 00:00:00 +0200
https://emlun.se/advent-of-code-2019/2020/08/26/day-16.html
https://emlun.se/advent-of-code-2019/2020/08/26/day-16.htmladvent-of-codealgorithmsdata-structuresenglishprogrammingrustperformanceadvent-of-code-2019