advent of languages day 1

This commit is contained in:
Joseph Montanaro 2024-12-02 10:34:06 -05:00
parent 72a9a2e1f1
commit dfc09d8861

View File

@ -0,0 +1,503 @@
---
title: 'Advent of Languages 2024, Day 1: C'
date: 2024-12-02
---
<script>import Sidenote from '$lib/Sidenote.svelte';</script>
As time goes on, it's becoming increasingly clear to me that I'm a bit of a programming-language dilletante. I'm always finding weird niche languages like [Pony](https://www.ponylang.io/) or [Roc](https://www.roc-lang.org), going "wow that looks cool," spending a bunch of time reading the documentation, and then never actually using it and forgetting all about it for the next three years.
This year, I've decided I'm going either buck that trend or double down on it, depending on your point of view. Instead of not engaging _at all_ with whatever random language strikes my fancy, I'm going to engage with it to the absolute minimum degree possible, then move on. Win-win, right? I get to _feel_ like I'm being more than a dilletante, but I don't have to do anything hard like _really_ learn a new language.
I should probably mention here, as a disclaimer, that I've never gotten all the way through an AoC in my life, and there's no way I'm going to do _better_ with _more_ problems to worry about. I'm guessing I'll peter out by day 12 or so, that's about as far as I usually get. Oh, and there's no way I'm going to stick to the one-day cadence either. It'll probably be May or so before I decide that enough is enough and I'm going to call it.<Sidenote>It's already December 2nd and I just finished the first half of Day 1, so clearly I'm shooting for more of a "slow and steady wins the race" cadence here.</Sidenote> Also, figuring out a new programming language every day is going to take enough time as it is, so I'm going to do this very stream-of-consciousness style. I apologize in advance for the haphazard organization of this and subsequent posts.
Anyway, I've decided to start with C, mostly because I'm scared of C and Day 1 of AoC is always the easiest, so I won't have to really get into it at all.<Sidenote>Ok, it's _also_ because I know that C doesn't have much in the way of built-in datastructures like hash maps and whatnot, so if you need one you end up having to either figure out how to use third-party C libraries (eugh) or write your own (even worse).</Sidenote>
## [The C Programming Language](https://en.wikipedia.org/wiki/The_C_Programming_Language)
C, of course, needs no introduction. It's known for being small, fast, and the language in which Unix was implemented, not to mention most (all?) other major OS kernels. It's also known for being the Abode of Monsters, i.e. there are few-to-no safeguards, and if you screw up the consequences might range from bad (segfaults in userland), to worse (kernel panics), to catastrophic (your program barfs out millions of users' highly sensitive data to anyone who asks).<Sidenote>To be fair, this last category isn't limited to C. Any language can be insecure if you try hard enough. Yes, even Rust.</Sidenote> I've seen it described as "Like strapping a jet engine to a skateboard - you'll go really fast, but if you screw up you'll end up splattered all over the sidewalk."<Sidenote>Sadly I can no longer locate the original source for this, but I have a vague recollection of it being somewhere on the [varnish](https://varnish-cache.org/) website, or possibly on the blog of someone connected with Varnish.</Sidenote>
All of which explains why I'm just a tad bit apprehensive to dip my toes into C. Thing is, for all its downsides, C is _everywhere_. Not only does it form the base layer of most computing infrastructure like OS kernels and network appliances, it's also far and away the most common language used for all the little computers that form parts of larger systems these days. You know, like cars, industrial controllers, robotics, and so on. So I feel like it would behoove me to at least acquire a passing familiarity with C one of these days, if only to be able to say that I have.
Oh, but to make it _extra_ fun, I've decided to try to get through at least the first part of Day 1 without using _any_ references at all, beyond what's already available on my computer (like manpages and help messages of commands). This is a terrible idea. Don't do things this way. Also, if you're in any way, shape, or form competent in C, please don't read the rest of this post, for your own safety and mine. Thank you.
## Experiments in C
Ok, let's get the basics out of the way first. Given a program, can I actually compile it and make it run? Let's try:
```c
#include "stdio.h" // pretty sure I've seen this a lot, I think it's for stuff like reading from stdin and writing to stdout
int main() { // the `int` means that this function returns an int, I think?
printf("hello, world!");
}
```
Now, I'm not terribly familiar with C toolchains, having mostly used them from several layers of abstraction up, but I'm _pretty_ sure I can't just compile this and run it, right? I think compiling will turn this into "object code", which has all the right bits in it that the computer needs to run it, but in order to put it all in a format that can actually be executed I need to "link" it, right?
Anyway, let's just try it and see.
```
$ cc 01.c
$ ls
>>> 01.c a.out
$ ./a.out
>>> "hello, world!"
```
Well, what do you know. It actually worked.<Sidenote>Amusingly, I realized later that it was totally by accident that I forgot to put a semicolone after the `#include`, but apparently this is the correct syntax so it just worked.</Sidenote> I guess the linking part is only necessary if you have multiple source files, or something?
## The Puzzle, Part 1
This is pretty encouraging, so let's tackle the actual puzzle for Day 1. [There's a bunch of framing story like there always is](https://adventofcode.com/2024/day/1), but the upshot is that we're given two lists arranged side by side, and asked to match up the smallest number in the first with the smallest number in the second, the second-smallest in the first with the second-smallest in the second, etc. Then we have to find out how far apart each of those pairs is, then add up all of those distances, and the total is our puzzle answer.
This is conceptually very easy, of course (it's only Day 1, after all). Just sort the two lists, iterate over them to grab the pairs, take `abs(a - b)` for each pair, and sum those all up. Piece of cake.
Except of course, that this is C, and I haven't the first idea how to do most of those things in C.<Sidenote>Ok, I'm pretty sure I could handle summing up an array of numbers in C. But how to create the array, or how to populate it with the numbers in question? Not a clue.</Sidenote>
### Loading data
Ok, so first off we'll need to read in the data from a file. That shouldn't be too hard, right? I know `fopen` is a thing, and I am (thankfully) on Linux, so I can just `man fopen` and see what I get, right? _type type_ Aha, yes! half a moment, I'll be back.
Mmmk, so `man fopen` gives me these very helpful snippets:
```
SYNOPSIS
#include <stdio.h>
FILE *fopen(const char *pathname, const char *mode);
(...)
The argument mode points to a string beginning with one of the following sequences (possibly followed by additional characters, as described below):
r Open text file for reading. The stream is positioned at the beginning of the file.
(...)
```
Ok, so let's just try opening the file and then dumping the pointer to console to see what we have.
```c
#include "stdio.h"
int main() {
int f_ptr = fopen("data/01.txt", "r");
printf(f_ptr);
}
```
```
$ cc 01.c
>>> 01.c: In function main:
01.c:4:17: warning: initialization of int from FILE * makes integer from pointer without a cast [-Wint-conversion]
4 | int f_ptr = fopen("data/01.txt", "r");
| ^~~~~
01.c:5:12: warning: passing argument 1 of printf makes pointer from integer without a cast [-Wint-conversion]
5 | printf(f_ptr);
| ^~~~~
| |
| int
In file included from 01.c:1:
/usr/include/stdio.h:356:43: note: expected const char * restrict but argument is of type int
356 | extern int printf (const char *__restrict __format, ...);
| ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~
01.c:5:5: warning: format not a string literal and no format arguments [-Wformat-security]
5 | printf(f_ptr);
```
...oh that's right, this is C. we can't just print an integer, it would interpret that integer as a pointer to a string and probably segfault. In fact...
```
$ ./a.out
>>> Segmentation fault (core dumped)
```
Right. Ok, well, `man` was our friend last time, maybe it can help here too?
`man printf`
Why, yes! Yes it--oh wait, no. No, this isn't right at all.
Oh yeah, `printf` is _also_ a standard Unix shell command, so `man printf` gives you the documentation for _that_. I guess `man fopen` only worked because `fopen` is a syscall, as well as a library function. Oh well, let's just see if we can guess the right syntax.
```c
#include "stdio.h"
int main() {
int f_ptr = fopen("data/01.txt", "r");
printf("%i", f_ptr);
}
```
```
$ cc 01.c
$ ./a.out
>>> 832311968
```
Hey, would you look at that! Weirdly enough, so far it's been my Python experience that's helped most, first with the `fopen` flags and now this. I guess Python wears its C heritage with pride.
I'm cheating a little, by the way. Well, kind of a lot. I switched editors recently and am now using [Zed](https://zed.dev) primarily (for languages it supports, at least), and Zed automatically runs a C language server by default when you're working in C.<Sidenote>Actually, it might be a C++ language server? At least, it keeps suggesting things from `std::` which I think is a C++ thing.</Sidenote> Which is pretty helpful, because now I know the _proper_ (at least, more proper) way to do this is:
```c
FILE *file = fopen("data/01.txt", "r");
```
so now we have a pointer to a `FILE` struct, which we can give to `fread()` I think? `man fread` gives us this:
```
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);
```
Which means, I think, that `fread()` accepts a pointer to a region of memory _into_ which it's reading data, an item size and a number of items,<Sidenote>Which surprised me, I was expecting just a number of bytes to read from the stream. Probably, again, because of my Python experience (e.g. this is how sockets work in Python).</Sidenote> and of course the pointer to the `FILE` struct.
Ok, great. Before we can do that, though, we need to get that first pointer, the one for the destination. `man malloc` is helpful, telling me that I just need to give it a number and it gives me back a `void *` pointer. I think it's `void` because it doesn't really have a type--it's a pointer to uninitialized memory, so you can write to it, but if you try to read from it or otherwise interpret it as being of any particular type it might blow up in your face.
Anyway:
```c
#include "stdio.h"
#include "stdlib.h"
int main() {
FILE *file = fopen("data/01.txt", "r");
void *data = malloc(16384);
size_t data_len = fread(data, 1, 16384, file);
printf("%zu", n_read);
}
}
```
I happen to know that my personal puzzle input is 14KB, so this will be enough. If the file were bigger, I'd have to either allocate more memory or read it in multiple passes. Oh, the joys of working in a non-memory-managed language.
Running this outputs `14000`, so I think it worked. I'm not sure if there's a performance penalty for using an item size of 1 with `fread`, but I'm guessing not. I highly doubt, for instance, that under the hood this is translating to 14,000 individual syscalls, because that would be a) completely bonkers and b) unnecessary since it already knows ahead of time what the max size of the read operation is going to be.<Sidenote>Reading up on this further suggests that the signature of `fread` is mostly a historical accident, and most people either do `fread(ptr, 1, bufsize, file)` (if reading less than the maximum size is acceptable) or `fread(ptr, bufsize, 1, file)` (if incomplete reads are to be avoided.)</Sidenote>
### Splitting ~~hairs~~ strings
Ok, next up we're going to have to a) somehow split the file into lines, b) split each line on the whitespace that separates the two columns, and c) parse those strings as integers. Some poking at the language server yields references to `strsep`, which appears to do exactly what I'm looking for:
```
char *strsep(char **stringp, const char *delim);
If *stringp is NULL, the strsep() function returns NULL and does nothing else.
Otherwise, this function finds the first token in the string *stringp, that is
delimited by one of the bytes in the string delim. This token is terminated by
overwriting the delimiter with a null byte ('\0'), and *stringp is updated to
point past the token.
```
I'm not quite sure what that `**stringp` business is, though. It wants a pointer to a pointer, I guess?<Sidenote>Coming back to this later, I realized: I think this is just how you do mutable arguments in C? If you just pass in a regular argument it seems to get copied, so changes to it aren't visible to the caller. So instead you pass in a pointer. But in this case, what needs to be mutated is _already a pointer_, so you have to pass a pointer to a pointer.</Sidenote> The language server suggests that `&` is how you create a pointer to something that you already have, so let's try that (by the way, I'm going to stop including all the headers and the `int main()` and all, and just include the relevant bits from now on):
```c
#include "string.h"
char *test = "hello.world";
char* res = strsep(&test, ".");
```
```
$ cc 01.c && ./a.out
>>> Segmentation fault (core dumped)
```
Hmm. That doesn't look too good.
However, further reflection suggests that my issue may just be that I'm using a string literal as `stringp` here, which means (I think) that it's going to be encoded into the data section of my executable, which makes it _not writable_ when the program is running. So you know what, let's just YOLO it. Going back to what we had:
```c
FILE *file = fopen("data/01.txt", "r");
void *data = malloc(16384);
size_t data_len = fread(data, 1, 16384, file);
char* first_line = strsep(&data, "\n");
printf("%s", first_line);
```
Compiling this generates dire warnings about passing an argument of type `void **` to a function expecting `char **`, but this is C, so I can just ignore those and it will operate on the assumption that I know what I'm doing<Sidenote>Oh, you sweet, simple summer child.</Sidenote> and treat that pointer as if it were `char **` anyway. And lo and behold:
```
$ ./a.out
>>> 88450 63363
```
It works!
Next question: Can I `strsep` on an multi-character delimiter?
```c
char* first_word = strsep(&first_line, " ");
printf("%s", first_word);
```
```
$ cc 01.cc && ./a.out
>>> 88450 6336388450
```
Aw, it didn't w--wait, no it did. It's just still printing `first_line` from above, and not printing a newline after that, so `first_word` gets jammed right up against it. Hooray!
### Integer-ation hell
Ok, last piece of the data-loading puzzle is to convert that string to an integer. I'm pretty sure I remember seeing a `strtoi` function in C examples that I've seen before, so let's try that.
Wait, no. There is no `strtoi`, but there _is_ a `strtol` ("string to long integer"), so let's try that instead.
```c
int i = strtol(first_word, NULL, 10);
printf("%i", i);
```
and...
```
$ cc 01.c && ./a.out
>>> 88450
```
Aww yeah. We got integers, baby! (Apparently `int` and `long` are synonymous? At least, they are for me right now on this machine, which is enough to be going on with.)
That second argument to `strtol`, by the way, is apparently `endptr`, about which the manpage has this to say:
```
If endptr is not NULL, strtol() stores the address of the first invalid character
in *endptr. If there were no digits at all, strtol() stores the original value of
nptr in *endptr (and returns 0). In particular, if *nptr is not '\\0' but **endptr
is '\\0' on return, the entire string is valid.
```
Sounds kind of like we could use that to avoid a second call to `strsep`, but it seems like a six-of-one, half-a-dozen-of-the-other situation, and I'm too lazy to figure it out, so whatever.
### Who needs arrays, anyway?
Ok, so we have the basic shape of how to parse our input data. Now we just need somewhere to put it, and for that we're obviously going to need an array. Now, as I understand it, C arrays are basically just pointers. The compiler keeps track of the size of the type being pointed to, so when you access an array you're literally just multiplying the index by the item size, adding that to the pointer that marks the start of the array, and praying the whole thing doesn't come crashing down around your ears.
I'm not sure of the appropriate way to create a new array, but I'm pretty sure `malloc` is going to have to be involved somehow, so let's just force the issue. `sizeof` tells me that an `int` (or `long`) has a size of 4 (so it's a 32-bit integer). I don't know exactly how many integers are in my puzzle input, but I know that it's 14,000 bytes long, and each line must consume at least 6 bytes (first number, three spaces, second number, newline), so the absolute upper bound on how many lines I'm dealing with is 2333.333... etc. Since each integer is 4 bytes that means each array will need to be just under 10 KB, but I think it's standard practice to allocate in powersof 2, so whatever, let's just do 16 KiB again.
Not gonna lie here, this one kind of kicked my butt. I would have expected the syntax for declaring an array to be `int[] arr = ...`, but apparently no, it's actually `int arr[] = ...`. Ok, that's fine, but `int arr[] = malloc(16384)` gets me `error: Invalid initializer`, without telling me what the initializer is.
Okay, fine. I'll look up the proper syntax for Part 2. For now let's just use pointers for everything. Whee! Who now? Safety? Never heard of her. BEHOLD!
```c
void* nums_l = malloc(16384);
void* nums_r = malloc(16384);
int nlines = 0;
while (1) {
char* line = strsep(&data, "\n");
int left = strtol(line, &line, 10);
int right = strtol(line, NULL, 10);
// if `strtol` fails, it apparently just returns 0, how helpful
if (left == 0 && right == 0) {
break;
}
int *addr_l = (int *)(nums_l + nlines * 4);
*addr_l = left;
int *addr_r = (int *)(nums_r + nlines * 4);
*addr_r = right;
nlines++;
}
```
Doesn't that just fill you with warm cozy feelings? No? Huh, must be just me then.
Oh yeah, I did end up figuring out how to do the `endptr` thing with `strtol`, it wasn't too hard.
### Sorting and finishing touches
Ok, next up, we have to sort these arrays. Is there even a sorting algorithm of any kind in the C standard library? I can't find anything promising from the autosuggestions that show up when I type `#include "`, and I don't feel like trying to implement quicksort without even knowing the proper syntax for declaring an array,<Sidenote>Or having a very deep understanding of quicksort, for that matter.</Sidenote> so I guess it's To The Googles We Must Go.
..._Gosh_ darn it, it's literally just called `qsort`. Ok, fine, at least I won't use Google for the _usage_.
You have to pass it a comparison function, which, sure, but that function accepts two arguments of type `const void *`, which makes the compiler scream at me when I attempt to a) pass it a function that takes integer pointers instead, or b) cast the void pointers to integers. Not sure of the proper way to do this so I'm just going to ignore the warnings for now because it _seems_ to work, and...
```c
int cmp(const int *a, const int *b) {
int _a = (int)(*a);
if (*a > *b) {
return 1;
}
else if (*a == *b) {
return 0;
}
else {
return -1;
}
}
// later, in main()
qsort(nums_l, nlines, 4, cmp);
qsort(nums_r, nlines, 4, cmp);
int sum = 0;
for (int i = 0; i < nlines - 1; i++) {
int *left = (int *)(nums_l + i * 4);
int *right = (int *)(nums_r + i * 4);
int diff = *left - *right;
if (diff < 0) {
diff = diff * -1;
}
sum += diff;
}
printf("%i", sum);
```
```
$ cc 01.c && ./a.out
>>> (compiler warnings)
2580759
```
Could it be? Is this it?
...nope. Knew it was too good to be true.
Wait, why am I using `nlines - 1` as my upper bound? I was trying to avoid an off-by-one error, because of course the "array" is "zero-indexed" (or would be if I were using arrays properly) and I didn't want to go past the end. But, of course, I forgot that `i < nlines` will _already_ stop the loop after the iteration where `i = 999`. Duh. That's not even a C thing, I could have made that mistake in Javascript. Golly. I guess my excuse is that I'm so busy focusing on How To C that I'm forgetting things I already knew?
Anyway, after correcting that error, my answer does in fact validate, so hooray! Part 1 complete!
Ok, before I go on to Part 2, I am _definitely_ looking up how to do arrays.
## Interlude: Arrays in C
Ok, so turns out there are fixed-size arrays (where the size is known at compile time), and there are dynamically-sized arrays, and they work a little differently. Fixed-sized arrays can be declared like this:
```c
int arr[] = {1, 2, 3, 4}; // array-literal syntax
int arr[100]; // declaring an array of a certain size, but uninitialized
```
Then you have dynamically-sized arrays, where the size of the array might not be known until runtime, so you have to use `malloc` (or `alloca` I guess, but you have to be real careful not to overflow your stack when you do that):
```c
int *arr = malloc(1000 * sizeof(int));
```
That's it, you're done. Apparently the fact that `arr` is declared as what looks like (to me, anyhow) _a pointer to an int_ is enough to tell the compiler, when accessed with square brackets, that this is actually a pointer to an _array_ of ints, and to multiply the index by the appropriate size (so I guess 4 in this case) to get the element at that array index.
Interestingly, with the above snippet, when I started accessing various indexes over 1000 to see what would happen, I got all the way to 32768 before it started to segfault.<Sidenote>The actual boundary is somewhere between 33000 and 34000, not sure where exactly because I got bored trying different indices.</Sidenote> I guess `malloc` doesn't even get out of bed for allocations less than 128 KiB?<Sidenote>Actually, what's probably happening is that `malloc` is requesting a bigger chunk _from the system_ in order to speed up any future allocations I might want to do. So if I were to call `malloc` again, it would just give me another chunk from that same region of memory. But of course C doesn't care what memory I access, it's only the system that enforces memory boundaries, and those are process-wide, so as long as I'm within that same chunk I'm fine. Just a guess, though.</Sidenote>
Armed with this new superpower, my full solution to Part 1 becomes:<Sidenote>I could (and probably should) also be using fixed-size arrays here, since it doesn't seem like there's any advantage to using `malloc`.</Sidenote>
```c
#include "stdio.h"
#include "stdlib.h"
#include "string.h"
int cmp(const int *a, const int *b) {
if (*a > *b) {
return 1;
}
else if (*a == *b) {
return 0;
}
else {
return -1;
}
}
int main() {
FILE *file = fopen("data/01.txt", "r");
char *data = (char *)malloc(16384);
size_t data_len = fread(data, 1, 16384, file);
int *nums_l = malloc(16384);
int *nums_r = malloc(16384);
int nlines = 0;
while (1) {
char* line = strsep(&data, "\n");
int left = strtol(line, &line, 10);
int right = strtol(line, NULL, 10);
// if `strtol` fails, it apparently just returns 0, how helpful
if (left == 0 && right == 0) {
break;
}
nums_l[nlines] = left;
nums_r[nlines] = right;
nlines++;
}
qsort(nums_l, nlines, 4, cmp);
qsort(nums_r, nlines, 4, cmp);
int sum = 0;
for (int i = 0; i < nlines; i++) {
int diff = nums_l[i] - nums_r[i];
if (diff < 0) {
diff = diff * -1;
}
sum += diff;
}
printf("%i", sum);
}
```
Still getting compiler warnings about `cmp` not matching the required signature, though. Maybe I'll figure that out for Part 2.<Sidenote>Later addendum: I did end up figuring this out. Short version, I was just forgetting that the arguments are pointers. Instead of casting to `int` I needed to cast to `int *`.</Sidenote>
## Part 2
Part 2 has us counting frequencies instead of doing one-for-one comparisons. For each integer in the left column, we need to multiply it by the number of times it occurs in the right column, then add all those products together.
Obviously the _right_ way to do this would be to count occurrences for every integer in the right column, store those counts in a hash table, and then use those counts as we work through the left column. But C doesn't have a native hash table, and I don't particularly feel like trying to implement one (although I'm sure I would learn a lot more about C that way). But you know what? C is fast, and our arrays of numbers here are only 4 KB. My CPU has **64** KiB of L1 cache, so I'm _pretty sure_ that I can just be super-duper naive about this and iterate over the _entirety_ of the right column for every value in the left column. Sure, it's O(N^2), but N in this case is 1000, and a million operations on data in L1 cache isn't going to take hardly any time at all. So let's give it a shot.
```c
int count_occurrences(int num, int arr[], int len) {
int total = 0;
for (int i = 0; i < len; i++) {
if (arr[i] == num) {
total++;
}
}
return total;
}
int part2(int nums_l[], int nums_r[], int len) {
int score = 0;
for (int i = 0; i < len; i++) {
score += nums_l[i] * count_occurrences(nums_l[i], nums_r, len);
}
return score;
}
int main() {
// ...
int solution_2 = part2(nums_l, nums_r, nlines);
printf("Part 2: %i\n", solution_2);
}
```
And what do you know? It works!
And it takes about 30ms to run. [Schlemiel the Painter](https://www.joelonsoftware.com/2001/12/11/back-to-basics/)? Never heard of him.
I was going to make impressed comments here about how fast C is, but then I decided to try it in Python, and it takes less then a second there too, so... you know. Day 1 is just easy, even for brute-force solutions.
## And That's It
Hey, that wasn't so bad! I'm sure it would have been a lot harder had I waited until one of the later days, but even so, I can kind of see where C lovers are coming from now. It's a little bit freeing to be able to just throw pointers around and cast types to other types because hey, they're all just bytes in the end. I'm sure if I tried to do anything really complex in C, or read someone else's code, it would start to fall apart pretty quickly, but for quick-and-dirty one-off stuff--it's actually pretty good! Plus all the *nix system interfaces are C-native, so next time I'm fiddling with something at the system level I might just whip out the ol' cc and start throwing stuff at the wall to see what sticks.
By the way, if you'd like to hear more of my thoughts on C, I expect to be invited to speak at at least three major C conferences next year<Sidenote>_Are_ there even three major C conferences? I know there are lots for C++, but C++ is a lot more complex than C.</Sidenote> since I am now a Certified Expert Practitioner, and my schedule is filling up fast. Talk to my secretary before it's too late!