Compare commits

...

11 Commits

11 changed files with 1645 additions and 27 deletions

View File

@ -10,20 +10,35 @@
return null;
}
}
function ext(url) {
}
</script>
<script>
export let href;
export let rel = '';
export let rel = null;
let url = null;
try {
url = new URL(href);
}
catch {}
let isLocal = false;
if (href.startsWith('/') || url?.host === $page.url.host) {
isLocal = true;
}
// if href is not a valid url, assume that it's a relative link
const path = url?.pathname || href;
// set rel="external" on links to static files (i.e. local links with a dot in them)
if (isLocal && path.search(/\.\w+$/) > -1) {
rel = 'external';
}
</script>
{#if href.startsWith('/') || host(href) === $page.host}
<a data-sveltekit-preload-data="hover" {href} {rel}>
<slot></slot>
</a>
{:else}
<a {href}>
<slot></slot>
</a>
{/if}
<a data-sveltekit-preload-data={isLocal ? 'hover' : null} {href} {rel}>
<slot></slot>
</a>

View File

@ -6,16 +6,12 @@ import fs from 'node:fs';
// build table of contents and inject into frontmatter
export function localRemark() {
return (tree, vfile) => {
if (vfile.data.fm.toc === false) {
return;
}
let toc = [];
let description = null;
visit(tree, ['heading', 'paragraph'], node => {
// build table of contents and inject into frontmatter
if (node.type === 'heading') {
if (node.type === 'heading' && vfile.data.fm.toc !== false) {
toc.push({
text: toString(node),
depth: node.depth,
@ -28,7 +24,9 @@ export function localRemark() {
}
});
vfile.data.fm.toc = toc;
if (vfile.data.fm.toc !== false) {
vfile.data.fm.toc = toc;
}
vfile.data.fm.description = description;
}
}

View File

@ -4,7 +4,7 @@ import { postData } from '../_posts/all.js';
export function load({ params }) {
const i = postData.findIndex(p => p.slug === params.slug);
return {
prev: i > 0 ? postData[i - 1].slug : null,
next: i < postData.length - 1 ? postData[i + 1].slug : null,
prev: i < postData.length - 1 ? postData[i + 1].slug : null,
next: i > 0 ? postData[i - 1].slug : null,
};
}
}

View File

@ -0,0 +1,503 @@
---
title: 'Advent of Languages 2024, Day 1: C'
date: 2024-12-02
---
<script>import Sidenote from '$lib/Sidenote.svelte';</script>
As time goes on, it's becoming increasingly clear to me that I'm a bit of a programming-language dilletante. I'm always finding weird niche languages like [Pony](https://www.ponylang.io/) or [Roc](https://www.roc-lang.org), going "wow that looks cool," spending a bunch of time reading the documentation, and then never actually using it and forgetting all about it for the next three years.
This year, I've decided I'm going either buck that trend or double down on it, depending on your point of view. Instead of not engaging _at all_ with whatever random language strikes my fancy, I'm going to engage with it to the absolute minimum degree possible, then move on. Win-win, right? I get to _feel_ like I'm being more than a dilletante, but I don't have to do anything hard like _really_ learn a new language.
I should probably mention here, as a disclaimer, that I've never gotten all the way through an AoC in my life, and there's no way I'm going to do _better_ with _more_ problems to worry about. I'm guessing I'll peter out by day 12 or so, that's about as far as I usually get. Oh, and there's no way I'm going to stick to the one-day cadence either. It'll probably be May or so before I decide that enough is enough and I'm going to call it.<Sidenote>It's already December 2nd and I just finished the first half of Day 1, so clearly I'm shooting for more of a "slow and steady wins the race" cadence here.</Sidenote> Also, figuring out a new programming language every day is going to take enough time as it is, so I'm going to do this very stream-of-consciousness style. I apologize in advance for the haphazard organization of this and subsequent posts.
Anyway, I've decided to start with C, mostly because I'm scared of C and Day 1 of AoC is always the easiest, so I won't have to really get into it at all.<Sidenote>Ok, it's _also_ because I know that C doesn't have much in the way of built-in datastructures like hash maps and whatnot, so if you need one you end up having to either figure out how to use third-party C libraries (eugh) or write your own (even worse).</Sidenote>
## [The C Programming Language](https://en.wikipedia.org/wiki/The_C_Programming_Language)
C, of course, needs no introduction. It's known for being small, fast, and the language in which Unix was implemented, not to mention most (all?) other major OS kernels. It's also known for being the Abode of Monsters, i.e. there are few-to-no safeguards, and if you screw up the consequences might range from bad (segfaults in userland), to worse (kernel panics), to catastrophic (your program barfs out millions of users' highly sensitive data to anyone who asks).<Sidenote>To be fair, this last category isn't limited to C. Any language can be insecure if you try hard enough. Yes, even Rust.</Sidenote> I've seen it described as "Like strapping a jet engine to a skateboard - you'll go really fast, but if you screw up you'll end up splattered all over the sidewalk."<Sidenote>Sadly I can no longer locate the original source for this, but I have a vague recollection of it being somewhere on the [varnish](https://varnish-cache.org/) website, or possibly on the blog of someone connected with Varnish.</Sidenote>
All of which explains why I'm just a tad bit apprehensive to dip my toes into C. Thing is, for all its downsides, C is _everywhere_. Not only does it form the base layer of most computing infrastructure like OS kernels and network appliances, it's also far and away the most common language used for all the little computers that form parts of larger systems these days. You know, like cars, industrial controllers, robotics, and so on. So I feel like it would behoove me to at least acquire a passing familiarity with C one of these days, if only to be able to say that I have.
Oh, but to make it _extra_ fun, I've decided to try to get through at least the first part of Day 1 without using _any_ references at all, beyond what's already available on my computer (like manpages and help messages of commands). This is a terrible idea. Don't do things this way. Also, if you're in any way, shape, or form competent in C, please don't read the rest of this post, for your own safety and mine. Thank you.
## Experiments in C
Ok, let's get the basics out of the way first. Given a program, can I actually compile it and make it run? Let's try:
```c
#include "stdio.h" // pretty sure I've seen this a lot, I think it's for stuff like reading from stdin and writing to stdout
int main() { // the `int` means that this function returns an int, I think?
printf("hello, world!");
}
```
Now, I'm not terribly familiar with C toolchains, having mostly used them from several layers of abstraction up, but I'm _pretty_ sure I can't just compile this and run it, right? I think compiling will turn this into "object code", which has all the right bits in it that the computer needs to run it, but in order to put it all in a format that can actually be executed I need to "link" it, right?
Anyway, let's just try it and see.
```
$ cc 01.c
$ ls
>>> 01.c a.out
$ ./a.out
>>> "hello, world!"
```
Well, what do you know. It actually worked.<Sidenote>Amusingly, I realized later that it was totally by accident that I forgot to put a semicolone after the `#include`, but apparently this is the correct syntax so it just worked.</Sidenote> I guess the linking part is only necessary if you have multiple source files, or something?
## The Puzzle, Part 1
This is pretty encouraging, so let's tackle the actual puzzle for Day 1. [There's a bunch of framing story like there always is](https://adventofcode.com/2024/day/1), but the upshot is that we're given two lists arranged side by side, and asked to match up the smallest number in the first with the smallest number in the second, the second-smallest in the first with the second-smallest in the second, etc. Then we have to find out how far apart each of those pairs is, then add up all of those distances, and the total is our puzzle answer.
This is conceptually very easy, of course (it's only Day 1, after all). Just sort the two lists, iterate over them to grab the pairs, take `abs(a - b)` for each pair, and sum those all up. Piece of cake.
Except of course, that this is C, and I haven't the first idea how to do most of those things in C.<Sidenote>Ok, I'm pretty sure I could handle summing up an array of numbers in C. But how to create the array, or how to populate it with the numbers in question? Not a clue.</Sidenote>
### Loading data
Ok, so first off we'll need to read in the data from a file. That shouldn't be too hard, right? I know `fopen` is a thing, and I am (thankfully) on Linux, so I can just `man fopen` and see what I get, right? _type type_ Aha, yes! half a moment, I'll be back.
Mmmk, so `man fopen` gives me these very helpful snippets:
```
SYNOPSIS
#include <stdio.h>
FILE *fopen(const char *pathname, const char *mode);
(...)
The argument mode points to a string beginning with one of the following sequences (possibly followed by additional characters, as described below):
r Open text file for reading. The stream is positioned at the beginning of the file.
(...)
```
Ok, so let's just try opening the file and then dumping the pointer to console to see what we have.
```c
#include "stdio.h"
int main() {
int f_ptr = fopen("data/01.txt", "r");
printf(f_ptr);
}
```
```
$ cc 01.c
>>> 01.c: In function main:
01.c:4:17: warning: initialization of int from FILE * makes integer from pointer without a cast [-Wint-conversion]
4 | int f_ptr = fopen("data/01.txt", "r");
| ^~~~~
01.c:5:12: warning: passing argument 1 of printf makes pointer from integer without a cast [-Wint-conversion]
5 | printf(f_ptr);
| ^~~~~
| |
| int
In file included from 01.c:1:
/usr/include/stdio.h:356:43: note: expected const char * restrict but argument is of type int
356 | extern int printf (const char *__restrict __format, ...);
| ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~
01.c:5:5: warning: format not a string literal and no format arguments [-Wformat-security]
5 | printf(f_ptr);
```
...oh that's right, this is C. we can't just print an integer, it would interpret that integer as a pointer to a string and probably segfault. In fact...
```
$ ./a.out
>>> Segmentation fault (core dumped)
```
Right. Ok, well, `man` was our friend last time, maybe it can help here too?
`man printf`
Why, yes! Yes it--oh wait, no. No, this isn't right at all.
Oh yeah, `printf` is _also_ a standard Unix shell command, so `man printf` gives you the documentation for _that_. I guess `man fopen` only worked because `fopen` is a syscall, as well as a library function. Oh well, let's just see if we can guess the right syntax.
```c
#include "stdio.h"
int main() {
int f_ptr = fopen("data/01.txt", "r");
printf("%i", f_ptr);
}
```
```
$ cc 01.c
$ ./a.out
>>> 832311968
```
Hey, would you look at that! Weirdly enough, so far it's been my Python experience that's helped most, first with the `fopen` flags and now this. I guess Python wears its C heritage with pride.
I'm cheating a little, by the way. Well, kind of a lot. I switched editors recently and am now using [Zed](https://zed.dev) primarily (for languages it supports, at least), and Zed automatically runs a C language server by default when you're working in C.<Sidenote>Actually, it might be a C++ language server? At least, it keeps suggesting things from `std::` which I think is a C++ thing.</Sidenote> Which is pretty helpful, because now I know the _proper_ (at least, more proper) way to do this is:
```c
FILE *file = fopen("data/01.txt", "r");
```
so now we have a pointer to a `FILE` struct, which we can give to `fread()` I think? `man fread` gives us this:
```
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);
```
Which means, I think, that `fread()` accepts a pointer to a region of memory _into_ which it's reading data, an item size and a number of items,<Sidenote>Which surprised me, I was expecting just a number of bytes to read from the stream. Probably, again, because of my Python experience (e.g. this is how sockets work in Python).</Sidenote> and of course the pointer to the `FILE` struct.
Ok, great. Before we can do that, though, we need to get that first pointer, the one for the destination. `man malloc` is helpful, telling me that I just need to give it a number and it gives me back a `void *` pointer. I think it's `void` because it doesn't really have a type--it's a pointer to uninitialized memory, so you can write to it, but if you try to read from it or otherwise interpret it as being of any particular type it might blow up in your face.
Anyway:
```c
#include "stdio.h"
#include "stdlib.h"
int main() {
FILE *file = fopen("data/01.txt", "r");
void *data = malloc(16384);
size_t data_len = fread(data, 1, 16384, file);
printf("%zu", n_read);
}
}
```
I happen to know that my personal puzzle input is 14KB, so this will be enough. If the file were bigger, I'd have to either allocate more memory or read it in multiple passes. Oh, the joys of working in a non-memory-managed language.
Running this outputs `14000`, so I think it worked. I'm not sure if there's a performance penalty for using an item size of 1 with `fread`, but I'm guessing not. I highly doubt, for instance, that under the hood this is translating to 14,000 individual syscalls, because that would be a) completely bonkers and b) unnecessary since it already knows ahead of time what the max size of the read operation is going to be.<Sidenote>Reading up on this further suggests that the signature of `fread` is mostly a historical accident, and most people either do `fread(ptr, 1, bufsize, file)` (if reading less than the maximum size is acceptable) or `fread(ptr, bufsize, 1, file)` (if incomplete reads are to be avoided.)</Sidenote>
### Splitting ~~hairs~~ strings
Ok, next up we're going to have to a) somehow split the file into lines, b) split each line on the whitespace that separates the two columns, and c) parse those strings as integers. Some poking at the language server yields references to `strsep`, which appears to do exactly what I'm looking for:
```
char *strsep(char **stringp, const char *delim);
If *stringp is NULL, the strsep() function returns NULL and does nothing else.
Otherwise, this function finds the first token in the string *stringp, that is
delimited by one of the bytes in the string delim. This token is terminated by
overwriting the delimiter with a null byte ('\0'), and *stringp is updated to
point past the token.
```
I'm not quite sure what that `**stringp` business is, though. It wants a pointer to a pointer, I guess?<Sidenote>Coming back to this later, I realized: I think this is just how you do mutable arguments in C? If you just pass in a regular argument it seems to get copied, so changes to it aren't visible to the caller. So instead you pass in a pointer. But in this case, what needs to be mutated is _already a pointer_, so you have to pass a pointer to a pointer.</Sidenote> The language server suggests that `&` is how you create a pointer to something that you already have, so let's try that (by the way, I'm going to stop including all the headers and the `int main()` and all, and just include the relevant bits from now on):
```c
#include "string.h"
char *test = "hello.world";
char* res = strsep(&test, ".");
```
```
$ cc 01.c && ./a.out
>>> Segmentation fault (core dumped)
```
Hmm. That doesn't look too good.
However, further reflection suggests that my issue may just be that I'm using a string literal as `stringp` here, which means (I think) that it's going to be encoded into the data section of my executable, which makes it _not writable_ when the program is running. So you know what, let's just YOLO it. Going back to what we had:
```c
FILE *file = fopen("data/01.txt", "r");
void *data = malloc(16384);
size_t data_len = fread(data, 1, 16384, file);
char* first_line = strsep(&data, "\n");
printf("%s", first_line);
```
Compiling this generates dire warnings about passing an argument of type `void **` to a function expecting `char **`, but this is C, so I can just ignore those and it will operate on the assumption that I know what I'm doing<Sidenote>Oh, you sweet, simple summer child.</Sidenote> and treat that pointer as if it were `char **` anyway. And lo and behold:
```
$ ./a.out
>>> 88450 63363
```
It works!
Next question: Can I `strsep` on a multi-character delimiter?
```c
char* first_word = strsep(&first_line, " ");
printf("%s", first_word);
```
```
$ cc 01.cc && ./a.out
>>> 88450 6336388450
```
Aw, it didn't w--wait, no it did. It's just still printing `first_line` from above, and not printing a newline after that, so `first_word` gets jammed right up against it. Hooray!
### Integer-ation hell
Ok, last piece of the data-loading puzzle is to convert that string to an integer. I'm pretty sure I remember seeing a `strtoi` function in C examples that I've seen before, so let's try that.
Wait, no. There is no `strtoi`, but there _is_ a `strtol` ("string to long integer"), so let's try that instead.
```c
int i = strtol(first_word, NULL, 10);
printf("%i", i);
```
and...
```
$ cc 01.c && ./a.out
>>> 88450
```
Aww yeah. We got integers, baby! (Apparently `int` and `long` are synonymous? At least, they are for me right now on this machine, which is enough to be going on with.)
That second argument to `strtol`, by the way, is apparently `endptr`, about which the manpage has this to say:
```
If endptr is not NULL, strtol() stores the address of the first invalid character
in *endptr. If there were no digits at all, strtol() stores the original value of
nptr in *endptr (and returns 0). In particular, if *nptr is not '\\0' but **endptr
is '\\0' on return, the entire string is valid.
```
Sounds kind of like we could use that to avoid a second call to `strsep`, but it seems like a six-of-one, half-a-dozen-of-the-other situation, and I'm too lazy to figure it out, so whatever.
### Who needs arrays, anyway?
Ok, so we have the basic shape of how to parse our input data. Now we just need somewhere to put it, and for that we're obviously going to need an array. Now, as I understand it, C arrays are basically just pointers. The compiler keeps track of the size of the type being pointed to, so when you access an array you're literally just multiplying the index by the item size, adding that to the pointer that marks the start of the array, and praying the whole thing doesn't come crashing down around your ears.
I'm not sure of the appropriate way to create a new array, but I'm pretty sure `malloc` is going to have to be involved somehow, so let's just force the issue. `sizeof` tells me that an `int` (or `long`) has a size of 4 (so it's a 32-bit integer). I don't know exactly how many integers are in my puzzle input, but I know that it's 14,000 bytes long, and each line must consume at least 6 bytes (first number, three spaces, second number, newline), so the absolute upper bound on how many lines I'm dealing with is 2333.333... etc. Since each integer is 4 bytes that means each array will need to be just under 10 KB, but I think it's standard practice to allocate in powersof 2, so whatever, let's just do 16 KiB again.
Not gonna lie here, this one kind of kicked my butt. I would have expected the syntax for declaring an array to be `int[] arr = ...`, but apparently no, it's actually `int arr[] = ...`. Ok, that's fine, but `int arr[] = malloc(16384)` gets me `error: Invalid initializer`, without telling me what the initializer is.
Okay, fine. I'll look up the proper syntax for Part 2. For now let's just use pointers for everything. Whee! Who now? Safety? Never heard of her. BEHOLD!
```c
void* nums_l = malloc(16384);
void* nums_r = malloc(16384);
int nlines = 0;
while (1) {
char* line = strsep(&data, "\n");
int left = strtol(line, &line, 10);
int right = strtol(line, NULL, 10);
// if `strtol` fails, it apparently just returns 0, how helpful
if (left == 0 && right == 0) {
break;
}
int *addr_l = (int *)(nums_l + nlines * 4);
*addr_l = left;
int *addr_r = (int *)(nums_r + nlines * 4);
*addr_r = right;
nlines++;
}
```
Doesn't that just fill you with warm cozy feelings? No? Huh, must be just me then.
Oh yeah, I did end up figuring out how to do the `endptr` thing with `strtol`, it wasn't too hard.
### Sorting and finishing touches
Ok, next up, we have to sort these arrays. Is there even a sorting algorithm of any kind in the C standard library? I can't find anything promising from the autosuggestions that show up when I type `#include "`, and I don't feel like trying to implement quicksort without even knowing the proper syntax for declaring an array,<Sidenote>Or having a very deep understanding of quicksort, for that matter.</Sidenote> so I guess it's To The Googles We Must Go.
..._Gosh_ darn it, it's literally just called `qsort`. Ok, fine, at least I won't use Google for the _usage_.
You have to pass it a comparison function, which, sure, but that function accepts two arguments of type `const void *`, which makes the compiler scream at me when I attempt to a) pass it a function that takes integer pointers instead, or b) cast the void pointers to integers. Not sure of the proper way to do this so I'm just going to ignore the warnings for now because it _seems_ to work, and...
```c
int cmp(const int *a, const int *b) {
int _a = (int)(*a);
if (*a > *b) {
return 1;
}
else if (*a == *b) {
return 0;
}
else {
return -1;
}
}
// later, in main()
qsort(nums_l, nlines, 4, cmp);
qsort(nums_r, nlines, 4, cmp);
int sum = 0;
for (int i = 0; i < nlines - 1; i++) {
int *left = (int *)(nums_l + i * 4);
int *right = (int *)(nums_r + i * 4);
int diff = *left - *right;
if (diff < 0) {
diff = diff * -1;
}
sum += diff;
}
printf("%i", sum);
```
```
$ cc 01.c && ./a.out
>>> (compiler warnings)
2580759
```
Could it be? Is this it?
...nope. Knew it was too good to be true.
Wait, why am I using `nlines - 1` as my upper bound? I was trying to avoid an off-by-one error, because of course the "array" is "zero-indexed" (or would be if I were using arrays properly) and I didn't want to go past the end. But, of course, I forgot that `i < nlines` will _already_ stop the loop after the iteration where `i = 999`. Duh. That's not even a C thing, I could have made that mistake in Javascript. Golly. I guess my excuse is that I'm so busy focusing on How To C that I'm forgetting things I already knew?
Anyway, after correcting that error, my answer does in fact validate, so hooray! Part 1 complete!
Ok, before I go on to Part 2, I am _definitely_ looking up how to do arrays.
## Interlude: Arrays in C
Ok, so turns out there are fixed-size arrays (where the size is known at compile time), and there are dynamically-sized arrays, and they work a little differently. Fixed-sized arrays can be declared like this:
```c
int arr[] = {1, 2, 3, 4}; // array-literal syntax
int arr[100]; // declaring an array of a certain size, but uninitialized
```
Then you have dynamically-sized arrays, where the size of the array might not be known until runtime, so you have to use `malloc` (or `alloca` I guess, but you have to be real careful not to overflow your stack when you do that):
```c
int *arr = malloc(1000 * sizeof(int));
```
That's it, you're done. Apparently the fact that `arr` is declared as what looks like (to me, anyhow) _a pointer to an int_ is enough to tell the compiler, when accessed with square brackets, that this is actually a pointer to an _array_ of ints, and to multiply the index by the appropriate size (so I guess 4 in this case) to get the element at that array index.
Interestingly, with the above snippet, when I started accessing various indexes over 1000 to see what would happen, I got all the way to 32768 before it started to segfault.<Sidenote>The actual boundary is somewhere between 33000 and 34000, not sure where exactly because I got bored trying different indices.</Sidenote> I guess `malloc` doesn't even get out of bed for allocations less than 128 KiB?<Sidenote>Actually, what's probably happening is that `malloc` is requesting a bigger chunk _from the system_ in order to speed up any future allocations I might want to do. So if I were to call `malloc` again, it would just give me another chunk from that same region of memory. But of course C doesn't care what memory I access, it's only the system that enforces memory boundaries, and those are process-wide, so as long as I'm within that same chunk I'm fine. Just a guess, though.</Sidenote>
Armed with this new superpower, my full solution to Part 1 becomes:<Sidenote>I could (and probably should) also be using fixed-size arrays here, since it doesn't seem like there's any advantage to using `malloc`.</Sidenote>
```c
#include "stdio.h"
#include "stdlib.h"
#include "string.h"
int cmp(const int *a, const int *b) {
if (*a > *b) {
return 1;
}
else if (*a == *b) {
return 0;
}
else {
return -1;
}
}
int main() {
FILE *file = fopen("data/01.txt", "r");
char *data = (char *)malloc(16384);
size_t data_len = fread(data, 1, 16384, file);
int *nums_l = malloc(16384);
int *nums_r = malloc(16384);
int nlines = 0;
while (1) {
char* line = strsep(&data, "\n");
int left = strtol(line, &line, 10);
int right = strtol(line, NULL, 10);
// if `strtol` fails, it apparently just returns 0, how helpful
if (left == 0 && right == 0) {
break;
}
nums_l[nlines] = left;
nums_r[nlines] = right;
nlines++;
}
qsort(nums_l, nlines, 4, cmp);
qsort(nums_r, nlines, 4, cmp);
int sum = 0;
for (int i = 0; i < nlines; i++) {
int diff = nums_l[i] - nums_r[i];
if (diff < 0) {
diff = diff * -1;
}
sum += diff;
}
printf("%i", sum);
}
```
Still getting compiler warnings about `cmp` not matching the required signature, though. Maybe I'll figure that out for Part 2.<Sidenote>Later addendum: I did end up figuring this out. Short version, I was just forgetting that the arguments are pointers. Instead of casting to `int` I needed to cast to `int *`.</Sidenote>
## Part 2
Part 2 has us counting frequencies instead of doing one-for-one comparisons. For each integer in the left column, we need to multiply it by the number of times it occurs in the right column, then add all those products together.
Obviously the _right_ way to do this would be to count occurrences for every integer in the right column, store those counts in a hash table, and then use those counts as we work through the left column. But C doesn't have a native hash table, and I don't particularly feel like trying to implement one (although I'm sure I would learn a lot more about C that way). But you know what? C is fast, and our arrays of numbers here are only 4 KB. My CPU has **64** KiB of L1 cache, so I'm _pretty sure_ that I can just be super-duper naive about this and iterate over the _entirety_ of the right column for every value in the left column. Sure, it's O(N^2), but N in this case is 1000, and a million operations on data in L1 cache isn't going to take hardly any time at all. So let's give it a shot.
```c
int count_occurrences(int num, int arr[], int len) {
int total = 0;
for (int i = 0; i < len; i++) {
if (arr[i] == num) {
total++;
}
}
return total;
}
int part2(int nums_l[], int nums_r[], int len) {
int score = 0;
for (int i = 0; i < len; i++) {
score += nums_l[i] * count_occurrences(nums_l[i], nums_r, len);
}
return score;
}
int main() {
// ...
int solution_2 = part2(nums_l, nums_r, nlines);
printf("Part 2: %i\n", solution_2);
}
```
And what do you know? It works!
And it takes about 30ms to run. [Shlemiel the Painter](https://www.joelonsoftware.com/2001/12/11/back-to-basics/)? Never heard of him.
I was going to make impressed comments here about how fast C is, but then I decided to try it in Python, and it takes less then a second there too, so... you know. Day 1 is just easy, even for brute-force solutions.
## And That's It
Hey, that wasn't so bad! I'm sure it would have been a lot harder had I waited until one of the later days, but even so, I can kind of see where C lovers are coming from now. It's a little bit freeing to be able to just throw pointers around and cast types to other types because hey, they're all just bytes in the end. I'm sure if I tried to do anything really complex in C, or read someone else's code, it would start to fall apart pretty quickly, but for quick-and-dirty one-off stuff--it's actually pretty good! Plus all the *nix system interfaces are C-native, so next time I'm fiddling with something at the system level I might just whip out the ol' cc and start throwing stuff at the wall to see what sticks.
By the way, if you'd like to hear more of my thoughts on C, I expect to be invited to speak at at least three major C conferences next year<Sidenote>_Are_ there even three major C conferences? I know there are lots for C++, but C++ is a lot more complex than C.</Sidenote> since I am now a Certified Expert Practitioner, and my schedule is filling up fast. Talk to my secretary before it's too late!

View File

@ -0,0 +1,313 @@
---
title: 'Advent of Languages 2024, Day 2: C++'
date: 2024-12-03
---
<script>import Sidenote from '$lib/Sidenote.svelte';</script>
Well, [Day 1](/advent-of-languages-2024-01) went swimmingly, more or less, so let's push on to Day 2: C++! C++, of course, is famous for being what happens when you take C and answer "Yes" to every question that starts "Can I have" and ends with a language feature. Yes, you can have classes and inheritance. Yes, even multiple inheritance. Yes, you can have constructors and destructors. Yes, you can have iterators (sorta).<Sidenote>More on that later.</Sidenote> Yes, you can have metaprogramming. Yes, you can have move semantics. Yes, you can also have raw pointers, why not?
It's ubiquitous in any context requiring a) high performance and b) a large codebase, such as browsers and game engines. It has a reputation for nightmarish complexity matched only by certain legal codes and the Third Edition of Dungeons & Dragons.<Sidenote>I'd be willing to bet you dollars to donuts that a non-trivial fraction of advanced C++ practitioners are also advanced D&D practitioners.</Sidenote> If using C is like firing a gun with a strong tendency to droop toward your feet any time your focus slips, then using C++ is like firing a Rube Goldberg machine composed of a multitude of guns which may or may not be pointed at your feet at any given time, and the only way to know is to pull the trigger.
How better, then, to spend Day 2 of Advent of Code?
## Will It ~~Blend~~ Compile?
I seem to recall hearing somewhere that C++ is a superset of C, so let's just start with the same hello-world as last time:
```c
#include "stdio.h"
int main() {
printf("hello, world!");
}
```
```
$ cpp 02.cpp
>>> # 0 "02.cpp"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "02.cpp"
# 1 "/usr/include/stdio.h" 1 3 4
# 27 "/usr/include/stdio.h" 3 4
(...much more in this vein)
```
Oh. Oh dear. That's not what I was hoping for at all.
So it seems that `cpp` doesn't produce executable code as its immediate artifact the way `cc` does. Actually, it looks kind of like it just barfs out C (non-++) code, and then you have to compile that with a separate C compiler? Let's try that.
```
$ cpp 02.cpp | cc
>>> cc: error: -E or -x required when input is from standard input
```
Hmm, well, that's progress, I guess? According to `cc --help`, `-E` tells it to "Preprocess only; do not compile, assemble or link", so that's not what I'm looking for. But wait, what's this?
```
-x <language> Specify the language of the following input files.
Permissible languages include: c c++ assembler none
'none' means revert to the default behavior of
guessing the language based on the file's extension.
```
Oho! Wait, does that mean I can just--
```
$ cc 02.cpp && ./a.out
>>> hello, world!
```
Well. That was a lot less complicated than I expected.<Sidenote>You may be thinking, of course it worked, you just fed plain C to a C compiler and it compiled, what's the big deal. I'm _pretty_ sure, though, that the `.cpp` extension does in fact tell the compiler to compile this _as C++_, if the help message is to be believed. The subsequent error when I try to use some actual C++ constructs has to do with whether and how much of the standard library is included by default--apparently there is a way to make plain `cc` work with `std::cout` and so on as well, it's just a little more involved.</Sidenote> I've got to say, I was expecting hours of frustration just getting the basic compiler toolchains to work with these OG languages like C and C++, but so far it's been surprisingly simple. I'm sure all of that goes right out the window the moment you need to make use of third-party code (beyond glibc that is), but for straightforward write-everything-yourself-the-old-fashioned-way work it's refreshingly simple.
Of course, after a little looking around I see that this isn't the idomatic way of outputting text in C++. That would be something more like this:
```cpp
#include <iostream>
int main() {
std::cout << "hello, world!";
}
```
```
$ cc 02.cpp && ./a.out
>>> /usr/bin/ld: /tmp/ccZD7l7S.o: warning: relocation against `_ZSt4cout' in read-only section `.text'
/usr/bin/ld: /tmp/ccZD7l7S.o: in function `main':
02.cpp:(.text+0x15): undefined reference to `std::cout'
/usr/bin/ld: 02.cpp:(.text+0x1d): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'
/usr/bin/ld: /tmp/ccZD7l7S.o: in function `__static_initialization_and_destruction_0(int, int)':
02.cpp:(.text+0x54): undefined reference to `std::ios_base::Init::Init()'
/usr/bin/ld: 02.cpp:(.text+0x6f): undefined reference to `std::ios_base::Init::~Init()'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
collect2: error: ld returned 1 exit status
```
Oh. Well, that's... informative. Or would be, if I knew what to look at.
I think the money line is this: `undefined reference to std::cout`, but I'm not sure what it means. The language server seemed to think that including `iostream` would make `std::cout` available.
Thankfully the ever-helpful Stack Overflow [came to the rescue](https://stackoverflow.com/a/28236905) and I was able to get it working by using `g++` rather than `cc`. Ok, I take back some of what I said about the simplicity of C-language toolchains.
## Day 2, Part 1
Ok, so let's look at [the actual puzzle](https://adventofcode.com/2024/day/2).
So we've got a file full of lines of space-separated numbers (again), but this time the lines are of variable length. Our job is, for every line, to determine whether or not the numbers as read left to right meet certain criteria. They have to be either all increasing or all decreasing, and they have to change by at least 1 but no more than 3 from one to the next.
Now, I know C++ has a much richer standard library than plain C, starting with `std::string`, so let's see what we can make it do. I'll start by just counting lines, to make sure I've got the whole reading-from-file thing working:
```cpp
#include <fstream>
#include <iostream>
#include <string>
using namespace std;
int main() {
ifstream file("data/02.txt");
string line;
int count = 0;
while (getline(file, line)) {
count++;
}
cout << count;
}
```
```
$ g++ 02.cpp && ./a.out
>>> 0
```
Oh, uh. Hmm.
Wait, I never actually downloaded my input for Day 2. `data/02.txt` doesn't actually exist. Apparently this isn't a problem? I guess I can see it being ok to construct an `ifstream` that points to a file that doedsn't exist (after all, you might be about to _create_ said file) but I'm a little confused that it will happily "read" from a non-existent file like this. If the file were present, but empty, it would presumably do the same thing, so I guess... non-extant and empty are considered equivalent? That's convenient for Redis, but I don't know that I approve of it in a language context.
Anyway, downloading the data and running the program again prints 1000, which seems right, so I think we're cooking with gas now.
### Interlude: Fantastic Files and How to Read Them
(I really need to find another joke, this one's wearing a bit thin.)
If you were wondering, by the way,<Sidenote>I was.</Sidenote> [a reference I found](https://cplusplus.com/reference/fstream/ifstream/) says that "Objects of this class maintain a `filebuf` object as their internal stream buffer, which performs input/output operations on the file they are associated with (if any)." So my guess is that we aren't actually doing 1000 separate reads from disk here, we're probably doing a few more reasonably-sized reads and buffering those in memory.
It does bug me a little bit that I'm copying each line for every iteration, but after some [tentative looking](https://brevzin.github.io/c++/2020/07/06/split-view/) for some equivalent of Rust's "iterate over string as a series of `&str`s" functionality I'm sufficiently cowed<Sidenote>Apparently C++ has a pipe operator? Who knew?</Sidenote> to just stick with the simple, obvious approach.
One thing's for sure in C++ world: Given a cat, there are guaranteed to be quite a few different ways to skin it.
### The rest of the owl
Anyway, let's do this.
```cpp
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
using namespace std;
vector<int> parse_line(string line) {
int start = 0;
vector<int> result;
while (start < line.length()) {
int end = line.find(" ", start);
if (end == -1) {
break;
}
string word = line.substr(start, end - start);
int n = stoi(word);
result.push_back(n);
start = end + 1;
}
return result;
}
bool is_valid(vector<int> report) {
int *prev_diff = nullptr;
for (int i = 1; i < report.size(); i++) {
int diff = report[i] - report[i - 1];
if (diff < -3 || diff == 0 || diff > 3) {
return false;
}
if (prev_diff == nullptr) {
*prev_diff = diff;
continue;
}
if ((diff > 0 && *prev_diff < 0) || (diff < 0 && *prev_diff > 0)) {
return false;
}
*prev_diff = diff;
}
return true;
}
int main() {
ifstream file("data/02.txt");
string line;
int count = 0;
while (getline(file, line)) {
auto report = parse_line(line);
if (is_valid(report)) {
count++;
}
}
cout << count;
}
```
And...<Sidenote>You may notice that my earlier concerns about unnecessary copying have been replaced with a cavalier disregard for memory allocations in every context. I subscribe to the ancient wisdom of "if you can't solve a problem, create another worse problem somewhere else and no one will care any more."</Sidenote>
```
$ g++ 02.cpp && ./a.out
>>> Segmentation fault (core dumped)
```
Oh.
Right, ok. I was trying to be fancy and use a pointer-to-an-int as sort of a poor man's `optional<T>`, mostly because I couldn't figure out how to instantiate an `optional<T>`. But of course, I can't just declare a pointer to an int as a null pointer, then do `*prev_diff = diff`, because that pointer still has to point _somewhere_, after all.
I could declare an int, then a _separate_ pointer which is _initially_ null, but then becomes a pointer to it later, but at this point I realized there's a much simpler solution:
```cpp
bool is_valid(vector<int> report) {
int prev_diff = 0;
for (int i = 1; i < report.size(); i++) {
int diff = report[i] - report[i - 1];
if (diff < -3 || diff == 0 || diff > 3) {
return false;
}
// on the first iteration, we can't compare to the previous difference
if (i == 1) {
prev_diff = diff;
continue;
}
if ((diff > 0 && prev_diff < 0) || (diff < 0 && prev_diff > 0)) {
return false;
}
prev_diff = diff;
}
return true;
}
```
This at least doesn't segfault, but it also doesn't give me the right answer.
Some debugging, a little frustration, and a few minutes later, though, it all works,<Sidenote>It was the parse function. I was breaking the loop too soon, so I was failing to parse the last integer from each line.</Sidenote> so it's time to move on to part 2!
## Part 2
In a pretty typical Advent of Code escalation, we now have to determine whether any of the currently-invalid lines would become valid with the removal of any one number. Now, I'm sure there are more elegant ways to do this, but...
```cpp
while (getline(file, line)) {
auto report = parse_line(line);
if (is_valid(report)) {
count_part1++;
count_part2++;
}
else {
for (int i = 0; i < report.size(); i++) {
int n = report[i];
report.erase(report.begin() + i);
if (is_valid(report)) {
count_part2++;
break;
}
report.insert(report.begin() + i, n);
}
}
}
cout << "Part 1: " << count_part1 << "\n";
cout << "Part 2: " << count_part2 << "\n";
}
```
The only weird thing here, once again [solved with the help of Stack Overflow](https://stackoverflow.com/questions/875103/how-do-i-erase-an-element-from-stdvector-by-index), was how the `erase` and `insert` methods for a vector expect not plain ol' integers but a `const_iterator`, which apparently is some sort of opaque type representing an index into a container? It's certainly not an "iterator" in the sense I'm familiar with, which is a state machine which successively yields values from some collection (or from some other iterator).
I'm just not sure why it needs to exist. The informational materials I can find [talk about](https://home.csulb.edu/~pnguyen/cecs282/lecnotes/iterators.pdf) how this is much more convenient than using integers, because look at this:
```cpp
for (j = 0; j < 3; ++j) {
...
}
```
Gack! Ew! Horrible! Who could possibly countenance such an unmaintainable pile of crap!
On the other hand, with _iterators_:
```cpp
for (i = v.begin(); i != v.end(); ++i) {
...
}
```
Joy! Bliss! Worlds of pure contentment and sensible, consistent programming practices!
Based on [further research](https://stackoverflow.com/questions/131241/why-use-iterators-instead-of-array-indices) it seems like iterators are essentially the C++ answer to the standardized iteration interfaces found in languages like Python, and that have since been adopted by virtually every language under the sun because they're hella convenient. In most languages, though, that takes the form of essentially a `foreach` loop, which is far and away (in my opinion) the most sensible way of approaching iteration. C++ just had to be different, I guess.<Sidenote>But never fear, C++ _also_ has a `foreach` loop!</Sidenote>
I should probably hold my criticism, though. After all, I've been using this language for less than 24 hours, whereas the C++ standards committee _presumably_ has a little more experience than that. And I'm sure the C++ standards committee has never made a bad decision, so I must just be failing to appreciate the depth and perspicacity of their design choices.
Anyway this all works now, so I guess that's Day 2 completed. Join us next time when we take on the great-graddad of all systems languages, **assembly**!
Just kidding, I'm not doing assembly. Not yet, anyway. Maybe next year.

View File

@ -0,0 +1,327 @@
---
title: 'Advent of Languages 2024, Day 3: Forth'
date: 2024-12-07
---
<script>import Sidenote from '$lib/Sidenote.svelte';</script>
My original plan was to stick with the "systems language" theme for Day 3 and go with Zig, but the more I thought about it the more I started to think, you know, Zig is nice and clean and modern. It hasn't had time to get all warty and ugly with bolted-on afterthoughts and contentious features that divide the community into warring tribes, and it has things like common datastructures in its standard library. I should probably save it for one of the later days, when I anticipate spending more time fighting _the problem_ and less time fighting _the language_. Also I looked at Day 3 and it (the first part at least) looked very simple, which makes me even less inclined to use a big honkin' heavy-duty language like Zig. Instead, today I'm going to take a look at Forth!<Sidenote>I know, I know, I would have been able to make all kinds of terrible jokes had I just waited for the _forth_ day of the AoC, but hey, we can't all get what we want.</Sidenote>
## May the Forth be with you
Forth is an old language, older even than C (by a few years at least), so you know right away it's going to be lacking a lot of modern conveniences like local variables or even, you know, structs. With named fields and all? Yep, not here.<Sidenote>I later discovered that this is implementation-specific--some Forths _do_ have structs, but others don't.</Sidenote> Forth is a [stack-oriented](https://en.wikipedia.org/wiki/Stack-oriented_programming) language, which I _think_ is a different kind of stack from the "stack/heap" you deal with in systems languages, although I might be wrong about that.
It's also _aggresively_ simple, both syntactically and conceptually. Syntactially, you could fit the grammar on the back of your hand. It's so slim that even _comparison operators_ like `<` and `=` (that's a comparison operator in Forth, like in SQL) are implemented as part of the _standard library_. Conceptually it's (if anything) even simpler.<Sidenote>Note that "simple" is not the same thing as "easy". C's memory model is simple: allocate memory, free memory. Don't free more than once, and don't free if it's still in use. Done! That doesn't stop it from being the root cause of some of the worst [security bugs](https://heartbleed.com/) of all time, or [massive worldwide outages](https://en.wikipedia.org/wiki/2024_CrowdStrike-related_IT_outages) affecting everything from banks to airlines.</Sidenote> There's a stack, and you put things onto it, and you take them off. Done.<Sidenote>Ok, ok, it's not _quite_ that simple. There are branching constructs, for instance (loops and conditionals), so not _everything_ can be modeled as pure stack operations. But it sure is a _lot_ simpler than most languages.</Sidenote>
Forth is--well, I think it's a bit of a stretch to describe it as "common" in any circumstance, but perhaps we can say "best-represented" in embedded systems contexts, where resources are often very heavily constrained and you're typically operating very close to the hardware. In fact it thrives in an environment where there's just a single contiguous block of not-very-much memory, because of the whole stack paradigm.
So of course, I'm going to use it for Advent of Code, where I have 32GB of RAM and a full-fat multi-process OS with extremely sophisticated thread scheduling, memory management, and so on. I wonder about myself sometimes.
## Where get?
Like a lot of older languages, there isn't a single official implementation of Forth which is your one-stop shop for compilers, linters, formatters, editor plugins, language servers, and all the various and sundry paraphernelia that have become de rigeur for a new language these days.<Sidenote>I don't know for sure what drives this difference, but my guess is that the Internet has made it much easier to coordinate something like programming-language development across a widely-separated group of people and organizations, so people naturally end up pooling their resources these days and all contributing to the One True Implementation.</Sidenote> It has a [standard](https://forth-standard.org/), and there are various implementations of that standard (scroll down to the "Systems" section on that page). [Gforth](https://gforth.org/) looks like the easiest to get up and running in, so let's give that a try.
The front page of the Gforth website has instructions for adding the Debian repository, but unfortunately that repository doesn't seem to be online anymore, so let's try building it from source, for which there are also instructions on the front page. The Git repository at least is hosted at a subdomain of `gnu.org`, so it might still be online?
_a brief interlude_
Okay, I'm back, and what do you know? It seems to have worked. There was a warning about `Swig without -forth feature, library interfaces will not be generated.`, but hopefully that's not important.
Okay, so we're good to go, right? At least, I can run `gforth` now and get a REPL, so I think we're good.
Wait a second, I forgot to run `make install`. But it's on my PATH already! What?
```
$ ls -l /usr/bin/gforth
>>> lrwxrwxrwx 1 root root 12 Sep 10 2021 /usr/bin/gforth -> gforth-0.7.3
```
...It was _already installed_?!
Well, paint me purple and call me a grape if that isn't the most embarrassing thing I've done all day.
## The real treasure was the friends we made along the way
I hope you enjoyed this brief exercise in futility. Let's just move on and pretend it never happened, hmmm?
Let's start Forth<Sidenote>Ok I'll stop now, I promise.</Sidenote> the same place we started C and C++, with hello-world. Unfortunately `highlight.js` has no syntax for Forth, so you'll just have to imagine the colors this time.
```fs
." hello world "
```
```
$ gforth 03.fs
>>> hello world Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
```
Uh. Okay, it looks like Forth doesn't exit unless you explicitly end your program with the `bye` command, it just does whatever you specified and then dumps you into a REPL so you can keep going, if you want. Hey, I wonder if giving it a non-interactive stdin would make a difference?
```
$ gforth </dev/null 03.fs
>>> hello world Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
```
Nope. I mean, it doesn't sit there waiting for input, of course. But it does print the start message. No, the only way to keep it from doing that is to put `bye` at the end of the script. Fascinating.
You may be struck by the spaces _inside_ the quote marks here: this is necessary. Without them, for example, `."hello` would be interpreted as a single token, and since there is no such "word" (the Forth term for functions, basically) defined it would error.<Sidenote>Actually, I discovered later that only the space after the _first_ quote mark is necessary. The second one is superfluous, and in fact gets interpreted as part of the string.</Sidenote>
## Day 3
Okay, so today we have a very minimal parsing challenge. We're given a long string of characters (just a single line this time) that contains a bunch of garbage and also some valid instructions, we have to parse out the instructions and execute them. There's only one instruction, `mul(x,y)` where x and y are numbers. Everything else is to be discarded, for Part 1 at least. I have a feeling that for Part 2 we may find out some of that garbage wasn't quite so garbacious after all.
This would be absolutely trivial with even the most basic of regular-expression libraries, but we're in Forth land here, so--wait. Maybe I should check some things before I make confident assertions.
...Yep, Gforth [does in fact](https://gforth.org/manual/Regular-Expressions.html) include regular expressions. Using them looks [pretty wild](https://git.savannah.gnu.org/cgit/gforth.git/tree/test/regexp-test.fs), though. I think I need to go read some more tutorials.
...
I'm back! It's been two days. Forth is confusing!
### Data loadin'
I've figured out how to at least open the file and read the data into memory, though, which is Great Success in my book. Here, let me show you:
```fs
create data 20000 allot
variable fileid
s" data/03.txt" r/o open-file
drop
fileid !
variable data-len
data 20000 fileid @ read-file
drop
data-len !
```
After executing this with `gforth`, I am dumped into an interactive Forth REPL, as before, but now I have a 20K buffer at the address specified by `data` containing the contents of my puzzle input, which is great!
Forth is _weird_ to anybody coming from the more standard type of language that has, you know, syntactic function calls and operators and all of that fancy stuff. None of that here! Remember how earlier I said that comparison operators were part of the standard library? Well yeah, it turns out that Forth doesn't even _have_ operators the same way other languages I've used do.<Sidenote>I'm aware that a lot of languages use operators as syntax sugar for functions defined somewhere else, but syntax sugar is just that: _syntax_. Forth doesn't make even a _syntactic_ distinction between operators and regular function calls. It's the equivalent of calling `add(a, b)` every time you want to add some numbers.</Sidenote> There are symbols like `>` and `=`, but really those are just conveniently-named "words" (again, the Forth name for a function) which operate on values from the stack just like other words with longer names. Crazy, huh?
Anyway, I want to go through this line-by-line because it's so far afield. Like I said, it took me 2 days to get this far. Note that because of the whole stack-oriented thing, Forth is syntactically backwards for most operations - you put the arguments _first_, thus putting them on the stack, then call the operation, which takes them off the stack and leaves its result (if any) on the stack. So to add 6 and 7 you would do `6 7 +`, which results in a stack containing 13. Wonky, I know.
* `create data 20000 allot`--Ok, I'm not actually 100% sure what this part is doing, I stole it from the gforth [tutorial](https://gforth.org/manual/Files-Tutorial.html#Open-file-for-input) on files, which is... minimal, shall we say? I think what's happening is that we're allocating memory, but we're "stack-allocating" (In the C sense, not the Forth sense). `create data` is actually the rare Forth construct that _isn't_ backwards, I think because it's a _compile-time_ operation rather than a runtime operation: we are `create`-ing a word, `data`, whose operation is simply to retun its own address. Then we are `allot`-ing 20K bytes, starting _immediately after that address_, I think? Anyway, the net result is that we end up with 20K of writable memory at the address returned by `data`, which is good enough for me.
* `variable fileid`--Another non-backwards construct here, `variable`. Again, I think this is because it creates a new _word_, which makes it a compile-time operation. I'm pretty sure `variable` is actually just a layer on top of `create ... allot`, because it does basically the same thing, it just only reserves enough space for a single integer. (Forth's native integer type is a signed 32-bit integer, by the way.)
* `s" data/03.txt" r/o open-file`--Now we're cooking with gas! `s" ..."` creates a string somewhere in memory (I think it actually allocates, under the hood) and puts the address and length of that string on the stack. `r/o` is just a convenience word that puts `0` on the stack; it represents the mode in which we're opening the file. `open-file` opens the file, using three values from the stack: 1) the address of the string containing the name of the file, 2) the length of that string, and 3) the mode. It returns two numbers to the stack, 1) the "file id" and 2) a status code.
* `drop`--I'm going to ignore the status code, so I `drop` it from the stack.
* `fileid !`--At this point we have the file ID on the stack, but we want to save it for future reference, so we store it at the address returned by `fileid`. The `!` word is just a dumb "store" command, it takes a value and an address on the stack and stores the value at that address. The reciprocal operation is `@`, which we'll see in a moment.
* `variable data-len`--Create a variable named `data-len` (again, just a word named `data-len` that returns an address where you can write data)
* `data 20000 fileid @ read-file`--This is the goods! We are reading (up to) 20K bytes from the file whose ID is stored in the variable `fileid` into the region of memory starting at address `data`. Note the `fileid @` bit, I got hung up here for the _longest_ time because I kept trying to just use `fileid` by itself. But that meant that I was giving `read-file` the _address of the variable itself_, not the value contained _at that varible_. In C terms, `@` is like dereferencing a pointer. (In fact, under the hood that's probably exactly what's going on, I think gforth is implemented largely in C.)
* `drop`--That last operation returned the number of bytes read, which we want, and a status code, which we don't. So we drop the status code.
* `data-len !`--Store the number of bytes read in the variable `data-len`.
At this point we have the contents of our file in memory, hooray! Boy, Forth is low-level. In some ways it's even lower-level than C, which is mind-boggling to me.
### Wait, regular expressions solved a problem?
Ok, so now we have our input, we need to process it.
My first attempt at this was an extremely laborious manual scanning of the string, attempting to parse out the digits from valid `mul(X,Y)` sequences, multiply them, keep a running total, etc. I got lost in all the complexity pretty quickly, so eventually I decided to just knuckle down and figure out the gforth regular expression library. And you know what? It wasn't too bad after all! Here's the regex I ended up coming up with:
```fs
require regexp.fs
: mul-instr ( addr u -- flag )
(( =" mul(" \\( {++ \\d ++} \\) ` , \\( {++ \\d ++} \\) ` ) )) ;
```
`: name ... ;` is the syntax for defining a new word, by the way. In this case, the body of the word _is_ the regex. I'm pretty sure this is required, because it's doing its magic at compile time, again--at least, if I try using a "bare" regex in the REPL, outside of a word, I get nothing. This seems to be the deal for most Forth constructs that have both a start and an end delimiter, like if/then, do/loop, and so on. The exceptions are of course the define-a-word syntax itself, and also string literals for some reason. Not sure why that is, but I'm guessing they're a special case because inline string literals are just _so_ useful.
Anyway, to elaborate a bit on the regex:
* `((` and `))` mark the start and end of the regex.
* `=" mul("` means "match the literal string `mul(`.
* `\(` and `\)` delimit a capturing group.
* `{++` and `++}` delimit a sequence repeated one-or-more times (like `+` in most regex syntaxes).
* A backtick followed by a single character means to match a single occurence of that character.
So this looks for the literal string `mul(`, followed by 1 or more digits, followed by the character `,` followed by 1 or more digits, followed by the character `)`. Not too bad, once you get your head around it.
Oh, and referencing captured groups is just `\1`, `\2`, `\3` etc. It seems like you can do this at any time after the regex has matched? I guess there's global state going on inside the regex library somewhere.
Anyway, here's my full solution to Day 3 Part 1!
```fs
create data 20000 allot \\ create a 20K buffer
variable fileid \\ create a variable `fileid`
s" data/03.txt" r/o open-file \\ open data file
drop \\ drop the top value from the stack, it's just a status code and we aren't going to bother handling errors
fileid ! \\ save the file id to the variable
variable data-len
data 20000 fileid @ read-file \\ read up to 20k bytes from the file
drop \\ drop the status code, again
data-len ! \\ store the number of bytes read in `data-len`
: data-str ( -- addr u )
\\ convenience function for putting the address and length of data on the stack
data data-len @ ;
: chop-prefix ( addr u u2 -- addr2 u2 )
\\ chop the first `u2` bytes off the beginning of the string at `addr u`
tuck \\ duplicate `u2` and store it "under" the length of the string
- \\ subtract `u2` from the length of the string
-rot \\ stick the new string length underneath the start pointer
+ \\ increment the start pointer by `u2`
swap \\ put them back in the right order
;
require regexp.fs
: mul-instr ( addr u -- flag )
\\\\ match a string of the form `mul(x,y)` where x and y are integers and capture those integers
(( =" mul(" \\( {++ \\d ++} \\) ` , \\( {++ \\d ++} \\) ` ) )) ;
: get-product ( addr u -- u2 )
mul-instr \\ match the string from `addr u` against the above regex
if \\ if the regex matches, then:
\\1 s>number drop \\ convert the first capture from string to number, drop the status code (we already know it will succeed)
\\2 s>number drop \\ convert the second capture from string to number, drop the status code
* \\ multiply, and leave the answer on the stack
else
0 \\ otherwise, leave 0 on the stack
then
;
variable result \\ initialize `result` with 0
0 result !
: sum-mul-instrs ( addr u -- u2 )
begin \\ start looping
s" mul(" search \\ search for the string "mul("
if \\ if successful, top 2 values on stack will be start address of "mul(" and remainder of original string
2dup \\ duplicate address and remaining length of string
get-product \\ pass those to get-product above
result @ + \\ load `result` and add to product
result ! \\ store this new value back in `result`
4 chop-prefix \\ bump the start of the string by 4 characters
else \\ if not successful, we have finished scanning through the string
2drop \\ dump the string address and length
result @ exit \\ put the result on top of the stack and return to caller
then
again
;
data-str sum-mul-instrs .
bye
```
Not exactly terse, but honestly it could have been a lot worse. And my extremely heavy use of comments makes it look bigger than it really is.
The `( addr u -- u2 )` bits are comments, by the way. By convention when you define a word that either expects things on the stack or leaves things on the stack, you put in a comment with the stack of the stack in `( before -- after )` format.
## Part 2
Onward and upward! In Part 2 we discover that yes, in fact, not quite all of the "garbage" instructions were truly garbage. Specifically, there are two instructions, `do()` and `don't()` which enable and disable the `mul(x,y)` instruction. So once you hit a `don't()`, you ignore all `mul(x,y)` instructions, no matter how well-formed, until you hit a `do()` again.
Easy enough, but I'm going to have to change some things. Right now I'm using the `search` word to find the start index of every possible `mul(` candidate, then using the regex library to both parse and validate at the same time. Obviously I can't do that any more, since now I have to search for any of three possible constructs rather than just one.
I spent quite a while trying to figure out how to get the regex library to spit out the address of the match it finds, but to no avail. There are some interesting hints about a "loop through all matches and execute arbitrary code on each match" functionality that I could _probably_ have shoehorned into what I needed here, but in the end I decided to just scan through the string the old-fashioned way and test for plain string equality at each position. In the end it came out looking like this:
```fs
...
variable enabled
-1 enabled ! \\ idiomatically -1 is "true" (really anything other than 0 is true)
: handle-mul ( addr u -- )
get-product \\ pass those to get-product above
result @ + \\ load `result` and add to product
result ! \\ store this new value back in `result`
;
: sum-mul-instrs ( addr u -- u2 )
\\ we want to loop from addr to (addr + u - 8), because 8 is the min length of a valid mul(x,y) instruction
\\ we also want to have addr + u on the top of the stack when we enter the loop,
\\ so that we can use that to compute the remaining length of the string from our current address
over + \\ copy addr to top of stack and add to length
dup 8 - \\ duplicate, then subtract 8 from the top value
rot \\ move original addr to top of stack
( stack at this point: [ addr + u, addr + u - 8, addr ] )
( i.e. [ end-of-string, loop-limit, loop-start ] )
do \\ start looping
I 4 s" do()" str= \\ compare the length-4 substring starting at I to the string "do()"
if \\ if valid do() instruction,
-1 enabled ! \\ set enabled=true
then
I 7 s" don't()" str= \\ compare length-7 substring to "don't()"
if \\ if valid don't() instruction,
0 enabled ! \\ set enabled=false
then
I 4 s" mul(" str= \\ compare length-4 substring to "mul("
enabled @ and \\ combine with current value of `enabled`
if \\ if a candidate for `mul(x,y)` instruction, and enabled=true, then
dup I - \\ subtract current string pointer from end-of-string pointer to get length of remaining string
I swap handle-mul \\ put current pointer onto stack again, swap so stack is ( addr len), and handle
then
loop
drop \\ get rid of end-of-string pointer
result @ \\ return value of result
;
s" data/03.txt" load-data
sum-mul-instrs .
bye
```
Oh yeah, I also decided to extract `load-data` into a word and pass in the filename, to make it easier to switch between test and live data. The whole thing is [here](https://git.jfmonty2.com/jfmonty2/advent/src/branch/master/2024/03.fs) if you're interested.
I'm actually surprised by how satisfied I am with this in the end. It's not exactly what I would call pretty, but it's reasonably comprehensible, and I feel like I'm starting to get the hang of this stack-manipulation business. It could definitely be more efficient - I'm still looping over every index in the string, even when I know I could skip some because, say, they already validated as being the start of a known instruction. Also, I should really avoid testing for subsequent instructions every time one of the prior ones validates. I just couldn't bring myself to do nested if statements because they [look like this](https://www.forth.com/starting-forth/4-conditional-if-then-statements/#h-nested-if-then-statements), which is horrible.
## Nobody asked for my opinion, but here it is anyway
So what do I think of Forth? It's certainly interesting! It's a very different way of approaching programming than I've encountered before. But I don't know that I'd want to use it for a serious project, because it's pretty lacking in the code-organization department.
A lost of older languages have shortcomings with regard to things like namespaces, good facilities for spreading code across multiple files (and complex file hierarchies), and tools for building on _other_ peoples' code. Forth has those problems too, but they aren't really fundamental. It's easy to imagine a version of Forth with some namespacing, a package manager, etc.<Sidenote>In fact, this is basically what [Factor](https://factorcode.org/) looks to be.</Sidenote> No, what worries me about Forth from a code-organization standpoint is _the stack itself_.
More specifically, it's the fact that there's only _one_ stack, and it's shared between every code-unit in the entire program. This might be ok if it were used exclusively for passing data around between different code units, but it isn't. From my limited experience, I get the impression that the stack is expected to be used for _most_ values. Sure, there are variables, but the amount of ceremony involved in using them makes it feel like Forth doesn't really want you to use them heavily. Plus, they're all global anyway, so they're hardly a help when it comes to code organization.
The problem with the stack is, as I said, that it's shared, and that everybody has it. That means that if you're storing something on the stack, and you invoke a word, _it might just mess with that something on the stack_. Sure, maybe it isn't _supposed_ to, but you know, bugs happen. The history of computer science is arguably a long and mostly-fruitless quest in search of _programming paradigms that result in fewer bugs_. "Just trust me bro" is not a useful approach to code encapsulation in complex projects.
Sure, you may say, but a function in any language can return incorrect values, so what's the big deal? Yes, that's true, but in most languages a function can't return _too many_ values, or too few. If it does, that's either a compile-time error or an _immediate_ runtime error, meaning the error occurs _at the point the misbehavior occurred._ This is critical, because errors get harder and harder to debug as their "skid distance" increases--i.e. the distance between the root cause of an error and the point at which that error actually manifests.
Even more worrisome, as I alluded to previously, is the fact that the stack makes it possible for words to mess with things that are (from the programmer's point of view) _totally unrelated to them_. You could end up with a situation like this:
```fs
put-some-memory-address-on-the-stack
17 do-something-unrelated
\\ an error in do-something-unrelated causes it to delete the memory address from the stck
do-more-things
...
attempt-to-access-missing-address \\ ERROR! What happened? Why isn't that address where I expected it to be?
```
This is the sort of thing that causes people to get woken up at 3AM to deal with production outages. The entire _class_ of purely-functional languages exists on the thesis data being twiddled with by bits of code that aren't supposed to is _so bad_ that it's better to disallow _all_ data-twiddling _forever_, full stop. I shudder to think what would happen if you dropped a Haskellite into a Forth world. He'd probably just keel over and die on the spot.
## So what is it actually good for?
The short answer is, embedded systems. That has traditionally been the wheelhouse of Forth, and as far as I can tell it continues to be so insofar as Forth is still used at all for real live projects.
This makes a lot of sense, when you think about it. Embedded code is often:
* Very tightly focused, concerned with just doing a few specific things under all circumstances (as contrasted with, say, your typical webapp which might have to add file uploads or start interfacing with some remote API any day of the week)
* Resource-constrained, particularly where memory is concerned (CPU time is usually less of a concern)
* Developed by a small team, often just a single person
* Fire-and-forget, there's no "ongoing maintenance" when you would have to perform a surgery or recall a satellite from orbit to do a firmware upgrade
All of this works to minimize the downsides of Forth's lack of organizational capabilities. Organizing is easier the less there is to organize, of course, and when there's only one or a few people doing the organizing. And when it's all done in one shot--I can't count the number of times I've come back to code _I wrote_ after a year or two or three and spent the day going "What was I _thinking_?" while trying to unravel the tangled web I wove. That sort of thing is much less likely when your code gets deployed as part of Industrial Sewage Pump Controller #BS94A, because who wants to go swimming in sewage if they don't have to?
The other reason I suspect that Forth has been so successful in embedded contexts is that it's _a lot_ easier to implement than most languages. This is true of stack-oriented languages in general, I think--there's a reason that a lot of VMs are stack-based as well--and in an embedded context, if you want a runtime of any kind you often have to build it yourself. I don't know this for sure, but I wouldn't be surprised if Forth got used for a lot of embedded stuff back in the day because it was the most feasible language to implement in raw assembly for the target architecture.<Sidenote>Of course, these days C is the lingua franca of everything, so I suspect this effect is weaker than it once was. Maybe there are some wild esoteric architectures out there that don't even have a C compiler, but I highly doubt there are many.</Sidenote>
Of course, I've constructed this tottering tower of linguistic philosophy on the basis of a few days' playing with Forth when I've had time, so take what I say with a few spoonfuls of salt. There are people out there who will [claim](http://collapseos.org/forth.html) that they used to prefer C to Forth, but after enough time with Forth their eyes were opened and their mind ascended to the heights, and they spoke with the tongues of the cherubim and painted with the colors of the wind--
Sorry, that got a little out of hand. My point is, some people like Forth even for relatively complex tasks like an OS kernel.<Sidenote>Although that particular guy is also prophesying the [end of civilization as we know it](http://www.collapseos.org/), so, I dunno. Maybe he just sees things in a different light.</Sidenote> Ultimately, though, I think the proof is in the pudding. Forth and C hit the scene at around the same time, but one of them went on to underpin the critical infrastructure of basically all computing, while the other continues to languish in relative obscurity, and we all know which is which.

View File

@ -0,0 +1,363 @@
---
title: 'Advent of Languages 2024, Day 4: Fortran'
date: 2024-12-10
---
<script>import Sidenote from '$lib/Sidenote.svelte';</script>
Oh, you thought we were done going back in time? Well I've got news for you, Doc Brown, you'd better not mothball the ol' time machine just yet, because we're going back even further. That's right, for Day 4 I've decided to use Fortran!<Sidenote>Apparently it's officially called `Fortran` now and not `FORTRAN` like it was in days of yore, and has been ever since the 1990s. That's right, when most languages I've used were just getting their start, Fortran was going through its mid-life identity crisis.</Sidenote><Sidenote>When I told my wife that I was going to be using a language that came out in the 1950s, she wanted to know if the next one would be expressed in Egyptian hieroglyphs.</Sidenote>
Really, though, it's because this is day _four_, and I had to replace all those missed Forth jokes with _something_.
## The old that is strong does not wither
Fortran dates back to 1958, making it the oldest programming language still in widespread use.<Sidenote>Says Wikipedia, at least. Not in the article about Fotran, for some reason, but in [the one about Lisp](https://en.wikipedia.org/wiki/Lisp_(programming_language)).</Sidenote> Exactly how widespread is debatable--the [TIOBE index](https://www.tiobe.com/tiobe-index/) puts it at #8, but the TIOBE index also puts Delphi Pascal at #11 and Assembly at #19, so it might have a different idea of what makes a language "popular" than you or I.<Sidenote>For contrast, Stack Overflow puts it at #38, right below Julia and Zig, which sounds a little more realistic to me.</Sidenote> Regardless, it's undeniable that it gets pretty heavy use even today--much more than Forth, I suspect--because of its ubiquity in the scientific and HPC sectors. The website mentions "numerical weather and ocean prediction, computational fluid dynamics, applied math, statistics, and finance" as particularly strong areas. My guess is that this largely comes down to intertia, plus Fortran being "good enough" for the things people wanted to use it for that it was easier to keep updating Fortran than to switch to something else wholesale.<Sidenote>Unlike, say, BASIC, which is so gimped by modern standards that it _doesn't even have a call stack_. That's right, you can't do recursion in BASIC, at least not without managing the stack yourself.</Sidenote>
And update they have! Wikipedia lists 12 major versions of Fortran, with the most recent being Fortran 2023. That's a pretty impressive history for a programming language. It's old enough to retire!
The later versions of Fortran have added all sorts of modern conveniences, like else-if conditionals (77), properly namespaced modules (90), growable arrays (also 90), local variables (2008), and finally, just last year, ternary expressions and the ability infer the length of a string variable from a string literal! Wow!
I have to say, just reading up on Fortran is already feeling modern than it did for Forth, or even C/C++. It's got a [snazzy website](https://fortran-lang.org/)<Sidenote>With a dark/light mode switcher, so you know it's hip.</Sidenote> with obvious links to documentation, sitewide search, and even an online playground. This really isn't doing any favors for my former impression of Fortran as a doddering almost-septegenarian with one foot in the grave and the other on a banana peel.
## On the four(tran)th day of Advent, my mainframe gave to me
The Fortran getting-started guide [literally gives you](https://fortran-lang.org/learn/quickstart/hello_world/) hello-world, so I won't bore you with that here. Instead I'll just note some interesting aspects of the language that jumped out at me:
* There's no `main()` function like C and a lot of other compiled languages, but there are mandatory `program <name> ... end program` delimiters at the start and end of your outermost layer of execution. Modules are defined outside of the `program ... end program` block. Not sure yet whether you can have multiple `program` blocks, but I'm leaning towards no?
* Variables are declared up-front, and are prefixed with their type name followed by `::`. You can leave out the type qualifier, in which case the type of the variable will be inferred not from the value to which it is first assigned, but from its _first letter_: variables whose names start with `i`, `j`, `k`, `l`, `m`, `n` are integers, everything else is a `real` (floating-point). Really not sure what drove that decision, but it's described as deprecated, legacy behavior anyway, so I plan to ignore it.
* Arrays are 1-indexed. Also, multi-dimensional arrays are a native feature! I'm starting to see that built-for-numerical-workloads heritage.
* It has `break` and `continue`, but they're named `exit` and `cycle`.
* There's a _built-in_ parallel-loop construct,<Sidenote>It uses different syntax to define its index and limit. That's what happens when your language development is spread over the last 65 years, I guess.</Sidenote> which "informs the compiler that it may use parallelization/SIMD to speed up execution". I've only ever seen this done at the library level before. If you're lucky your language has enough of a macro system to make it look semi-natural, otherwise, well, I hope you like map/reduce.
* It has functions, but it _also_ has "subroutines". The difference is that functions return values and are expected not to modify their arguments, and subroutines don't return values but may modify their arguments. I guess you're out of luck if you want to modify an argument _and_ return a value (say, a status code or something).
* Function and subroutine arguments are mentioned in the function signature (which looks like it does in most languages), but you really get down to brass tacks in the function body itself, which is where you specify the type and in-or-out-ness of the parameters. Reminds me of PowerShell, of all things.
* The operator for accessing struct fields is `%`. Where other languages do `sometype.field`, in Fortran you'd do `sometype%field`.
* Hey look, it's OOP! We can have methods! Also inheritance, sure, whatever.
Ok, I'm starting to get stuck in the infinite docs-reading rut for which I criticized myself at the start of this series, so buckle up, we're going in.
## The Puzzle
We're given a two-dimensional array of characters and asked to find the word `XMAS` everywhere it occurs, like those [word search](https://en.wikipedia.org/wiki/Word_search) puzzles you see on the sheets of paper they hand to kids at restaurants in a vain attempt to keep them occupied so their parents can have a chance to enjoy their meal.
Hey, Fortran might actually be pretty good at this! At least, multi-dimensional arrays are built in, so I'm definitely going to use those.
First things first, though, we have to load the data before we can start working on it.<Sidenote>Getting a Fortran compiler turned out to be as simple as `apt install gfortran`.</Sidenote>
My word-search grid appears to be 140 characters by 140, so I'm just going to hard-code that as the dimensions of my array. I'm sure there's a way to size arrays dynamically, but life's too short.
### Loading data is hard this time
Not gonna lie here, this part took me _way_ longer than I expected it to. See, the standard way to read a file in Fortran is with the `read()` statement. (It looks like a function call, but it's not.) You use it something like this:
```fortran
read(file_handle, *) somevar, anothervar, anothervar2
```
Or at least, that's one way of using it. But here's the problem: by default, Fortran expects to read data stored in a "record-based" format. In short, this means that it's expected to consist of lines, and each line will be parsed as a "record". Records consist of some number of elements, separated by whitespace. The "format" of the record, i.e. how the line should be parsed, can either be explicitly specified in a slightly arcane mini-language reminiscent of string-interpolation placeholders (just in reverse), or it can be inferred from the number and types of the variables specified after `read()`.
Initially, I thought I might be able to do this:
```fortran
character, dimension(140, 140) :: grid
! ...later
read(file_handle, *) grid
```
The top line is just declaring `grid` as a 2-dimensional array characters, 140 rows by 140 columns. Neat, huh?
But sadly, this kept spitting out errors about how it had encountered the end of the file unexpectedly. I think what was happening was that when you give `read()` an array, it expects to populate each element of the array with one record from the file, and remember records are separated by lines, so this was trying to assign one line per array element. My file had 140 lines, but my array had 140 * 140 elements, so this was never going to work.
My next try looked something like this:
```fortran
do row = 1, 100
read(file_handle, *) grid(row, :)
end do
```
But this also resulted in end-of-file errors. Eventually I got smart and tried running this read statement just _once_, and discovered that it was populating the first row of the array with the first letter of _each_ line in the input file. I think what's going on here is that `grid(1, :)` creates a slice of the array that's 1 row by the full width (so 140), and the `read()` statement sees that and assumes that it needs to pull 140 records from the file _each time this statement is executed_. But records are (still) separated by newlines, so the first call to `read()` pulls all 140 rows, dumps everything but the first character from each (because, I think, the type of the array elements is `character`), puts that in and continues on. So after just a single call to `read()` it's read every line but dumped most of the data.
I'm pretty sure the proper way to do this would be to figure out how to set the record separator, but it's tricky because the "records" (if we want each character to be treated as a record) within each line are smashed right up against each other, but have newline characters in between lines. So I'd have to specify that the separator is sometimes nothing, and sometimes `\n`, and I didn't feel like figuring that out because all of the references I could find about Fortran format specifiers were from ancient plain-HTML pages titled things like "FORTRAN 77 INTRINSIC SUBROUTINES REFERENCE" and hosted on sites like `web.math.utk.edu` where they probably _do_ date back to something approaching 1977.
So instead, I decided to just make it dumber.
```fortran
program advent04
implicit none
character, dimension(140, 140) :: grid
integer :: i
grid = load()
do i = 1, 140
print *, grid(i, :)
end do
contains
function load() result(grid)
implicit none
integer :: handle
character, dimension(140, 140) :: grid
character(140) :: line
integer :: row
integer :: col
open(newunit=handle, file="data/04.txt", status="old", action="read")
do row = 1, 140
! `line` is a `character(140)` variable, so Fortran knows to look for 140 characters I guess
read(handle, *) line
do col = 1, 140
! just assign each character of the line to array elements individually
grid(row, col) = line(col:col)
end do
end do
close(handle)
end function load
end program advent04
```
I am more than sure that there are several dozen vastly better ways of accomplishing this, but look, it works and I'm tired of fighting Fortran. I want to go on to the fun part!
### The fun part
The puzzle specifies that occurrences of `XMAS` can be horizontal, verical, or even diagonal, and can be written either forwards or backwards. The obvious way to do this would be to scan through the array, stop on every `X` character and cheak for the complete word `XMAS` in each of the eight directions individually, with a bunch of loops. Simple, easy, and probably more than performant enough because this grid is only 140x140, after all.<Sidenote>Although AoC has a way of making the second part of the puzzle punish you if you were lazy and went with the brute-force approach for the first part, so we'll see how this holds up when we get there.</Sidenote>
But! This is Fortran, and Fortran's whole shtick is operations on arrays, especially multidimensional arrays. So I think we can make this a lot more interesting. Let's create a "test grid" that looks like this:
```
S . . S . . S
. A . A . A .
. . M M M . .
S A M X M A S
. . M M M . .
. A . A . A .
S . . S . . S
```
Which has all 8 possible orientationS of the word `XMAS` starting from the central X. Then, we can just take a sliding "window" of the same size into our puzzle grid and compare it to the test grid. This is a native operation in Fortran--comparing two arrays of the same size results in a third array whose elements are the result of each individual comparison from the original arrays. Then we can just call `count()` on the resulting array to get the number of true values, and we know how many characters matched up. Subtract 1 for the central X we already knew about, then divide by 3 since there are 3 letters remaining in each occurrence of `XMAS`, and Bob's your uncle, right?
...Wait, no. That won't work because it doesn't account for partial matches. Say we had a "window" that looked like this (I'm only showing the bottom-right quadrant of the window for simplicity):
```
X M X S
S . . .
A . . .
X . . .
```
If we were to apply the process I just described to this piece of the grid, we would come away thinking there was 1 full match of `XMAS`, because there are one each of `X`, `M`, `A`, and `S` in the right positions. Problem is, they aren't all in the right places to be part of the _same_ XMAS, meaning that there isn't actually a match here at all.
To do this properly, we need some way of distinguishing the individual "rays" of the "star", which is how I've started thinking about the test grid up above, so that we know whether _all_ of any given "ray" is present. So what if we do it this way?
1. Apply the mask to the grid as before, but this time, instead of just counting the matches, we're going to convert them all to 1s. Non-matches will be converted to 0.
2. Pick a prime number for each "ray" of the "star". We can just use the first 8 prime numbers (excluding 1, of course). Create a second mask with these values subbed in for each ray, and 1 in the middle. So the ray extending from the central X directly to the right, for instance, would look like this, assuming we start assigning our primes from the top-left ray and move clockwise: `1 7 7 7`
3. Multiply this array by the array that we got from our initial masking operation. Now any matched characters will be represented by a prime number _specific to that ray of the star_.
4. Convert all the remaining 0s in the resulting array to 1s, then take the product of all values in the array.
5. Test whether that product is divisible by the cube of each of the primes used. E.g. if it's divisible by 8, we _know_ that there must have been three 2's in the array, so we _know_ that the top-left ray is entirely present. So we can add 1 to our count of valid `XMAS`es originating at this point.
Will this work? Is it even marginally more efficient than the stupidly obvious way of just using umpty-gazillion nested for loops--excuse me, "do loops"--to test each ray individually? No idea! It sure does sound like a lot more fun, though.
Ok, first things first. Let's adjust the data-loading code to pad the grid with 3 bogus values on each edge, so that we can still generate our window correctly when we're looking at a point near the edge of the grid.
```fortran
grid = '.' ! probably wouldn't matter if we skipped this, uninitialized memory just makes me nervous
open(newunit=handle, file="data/04.txt", status="old", action="read")
do row = 4, 143
read(handle, *) line
do col = 1, 140
grid(row, col + 3) = line(col:col)
end do
end do
```
Turns out assigning a value element to an array of that type of value (like `grid = '.'` above) just sets every array element to that value, which is very convenient.
Now let's work on the whole masking thing.
Uhhhh. Wait. We might have a problem here. When we take the product of all values in the array after the various masking and prime-ization stuff, we could _conceivably end up multiplying the cubes of the first 8 prime numbers. What's the product of the cubes of the first 8 prime numbers?
```
912585499096480209000
```
Hm, ok, and what's the max value of a 64-bit integer?
```
9223372036854775807
```
Oh. Oh, _noooo_.
It's okay, I mean, uh, it's not _that_ much higher. Only two orders of magnitude, and what are the odds of all eight versions of `XMAS` appearing in the same window, anyway? Something like 1/4<sup>25</sup>? Maybe we can still make this work.
```fortran
integer function count_xmas(row, col) result(count)
implicit none
integer, intent(in) :: row, col
integer :: i
integer(8) :: prod
integer(8), dimension(8) :: primes
character, dimension(7, 7) :: test_grid, window
integer(8), dimension(7, 7) :: prime_mask, matches, matches_prime
test_grid = reshape( &
[&
'S', '.', '.', 'S', '.', '.', 'S', &
'.', 'A', '.', 'A', '.', 'A', '.', &
'.', '.', 'M', 'M', 'M', '.', '.', &
'S', 'A', 'M', 'X', 'M', 'A', 'S', &
'.', '.', 'M', 'M', 'M', '.', '.', &
'.', 'A', '.', 'A', '.', 'A', '.', &
'S', '.', '.', 'S', '.', '.', 'S' &
], &
shape(test_grid) &
)
primes = [2, 3, 5, 7, 11, 13, 17, 19]
prime_mask = reshape( &
[ &
2, 1, 1, 3, 1, 1, 5, &
1, 2, 1, 3, 1, 5, 1, &
1, 1, 2, 3, 5, 1, 1, &
19, 19, 19, 1, 7, 7, 7, &
1, 1, 17, 13, 11, 1, 1, &
1, 17, 1, 13, 1, 11, 1, &
17, 1, 1, 13, 1, 1, 11 &
], &
shape(prime_mask) &
)
window = grid(row - 3:row + 3, col - 3:col + 3)
matches = logical_to_int64(window == test_grid)
matches_prime = matches * prime_mask
prod = product(zero_to_one(matches_prime))
count = 0
do i = 1, 8
if (mod(prod, primes(i) ** 3) == 0) then
count = count + 1
end if
end do
end function count_xmas
elemental integer(8) function logical_to_int64(b) result(i)
implicit none
logical, intent(in) :: b
if (b) then
i = 1
else
i = 0
end if
end function logical_to_int64
elemental integer(8) function zero_to_one(x) result(y)
implicit none
integer(8), intent(in) :: x
if (x == 0) then
y = 1
else
y = x
end if
end function zero_to_one
```
Those `&`s are line-continuation characters, by the way. Apparently you can't have newlines inside a function call or array literal without them. And the whole `reshape` business is a workaround for the fact that there _isn't_ actually a literal syntax for multi-dimensional arrays, so instead you have to create a 1-dimensional array and "reshape" it into the desired shape.
Now we just have to put it all together:
```fortran
total = 0
do col = 4, 143
do row = 4, 143
if (grid(row, col) == 'X') then
total = total + count_xmas(row, col)
end if
end do
end do
print *, total
```
These `elemental` functions, by the way, are functions you can ~~explain to Watson~~ apply to an array element-wise. So `logical_to_int64(array)` returns an array of the same shape with all the "logical" (boolean) values replaced by 1s and 0s.
This actually works! Guess I dodged a bullet with that 64-bit integer thing.<Sidenote>Of course I discovered later, right before posting this article, that Fortran totally has support for 128-bit integers, so I could have just used those and not worried about any of this.</Sidenote>
I _did_ have to go back through and switch out all the `integer` variables in `count_xmas()` with `integer(8)`s (except for the loop counter, of course). This changed my answer significantly. I can only assume that calling `product()` on an array of 32-bit integers, then sticking the result in a 64-bit integer, does the multiplication as 32-bit first and only _then_ converts to 64-bit, after however much rolling-over has happened. Makes sense, I guess.
Ok, great! On to part 2!
## Part 2
It's not actually too bad! I was really worried that it was going to tell me to discount all the occurrences of `XMAS` that overlapped with another one, and that was going to be a royal pain the butt with this methodology. But thankfully, all we have to do is change our search to look for _two_ occurrences of the sequence `M-A-S` arranged in an X shape, like this:
```
M . S
. A .
M . S
```
This isn't too difficult with our current approach. Unfortunately it will require four test grids applied in sequence, rather than just one, because again the sequence can be written either forwards or backwards, and we have to try all the permutations. On the plus side, we can skip the whole prime-masking thing, because each test grid is going to be all-or-nothing now. In fact, we can even skip checking any remaining test grids whenver we find a match, because there's no way the same window could match more than one.
Hmm, I wonder if there's a way to take a single starting test grid and manipulate it to reorganize the characters into the other shapes we need?
Turns out, yes! Yes there is. We can use a combination of slicing with a negative step, and transposing, which switches rows with columns, effectively rotating and flipping the array. So setting up our test grids looks like this:
```fortran
character, dimension(3, 3) :: window, t1, t2, t3, t4
t1 = reshape( &
[ &
'M', '.', 'S', &
'.', 'A', '.', &
'M', '.', 'S' &
], &
shape(t1) &
)
t2 = t1(3:1:-1, :) ! flip t1 top-to-bottom
t3 = transpose(t1) ! swap t1 rows for columns
t4 = t3(:, 3:1:-1) ! flip t3 left-to-right
```
Then we can just compare the window to each test grid:
```fortran
window = grid(row - 1:row + 1, col - 1:col + 1)
if ( &
count_matches(window, t1) == 5 &
.or. count_matches(window, t2) == 5 &
.or. count_matches(window, t3) == 5 &
.or. count_matches(window, t4) == 5 &
) then
count = 1
else
count = 0
end if
```
To my complete and utter astonishment, this actualy worked the first time I tried it, once I had figured out all of the array-flipping-and-rotating I needed to create the test grids. It always makes me suspicious when that happens, but Advent of Code confirmed it, so I guess we're good!<Sidenote>Or I just managed to make multiple errors that all cancelled each other out.</Sidenote>
It did expose a surprisingly weird limitation in the Fortran parser, though. Initially I kept trying to write the conditions like this: `if(count(window == t1) == 5)`, and couldn't understand the syntax errors it was throwing. Finally I factored out `count(array1 == array2)` into a separate function, and everything worked beautifully. My best guess is that the presence of two `==` operators inside a single `if` condition, not separated by `.and.` or `.or.`, is just a no-no. The the things we learn.
## Lessons ~~and carols~~
(Whoa now, we're not _that_ far into Advent yet.)
Despite being one of the oldest programming languages still in serious use, Fortran manages to feel surprisingly familiar. There are definite archaisms, like having to define the types of all your variables at the start of your program/module/function,<Sidenote>Even throwaway stuff like loop counters and temporary values.</Sidenote>, having to declare function/subroutine names at the beginning _and end_, and the use of the word "subroutine". But overall it's kept up surprisingly well, given--and I can't stress this enough--that it's _sixty-six years old_. It isn't even using `CAPITAL LETTERS` for everything any more,<Sidenote>Although the language is pretty much case-insensitive so you can still use CAPITALS if you want.</Sidenote> which puts it ahead of SQL,<Sidenote>Actually, I suspect the reason the CAPITALS have stuck around in SQL is that more than most languages, you frequently find yourself writing SQL _in a string_ from another language. Occasionally editors will be smart enough to syntax-highlight it as SQL for you, but for the times they aren't, using `CAPITALS` for all the `KEYWORDS` serves as a sort of minimal DIY syntax highlighting. That's what I think, at least.</Sidenote> and SQL is 10+ years younger.
It still has _support_ for a lot of really old stuff. For instance, you can label statements with numbers and then `go to` a numbered statement, but there's really no use for that in new code. We have functions, subroutines, loops, if-else-if-else conditionals--basically everything you would (as I understand it) use `goto` for back in the day.
Runs pretty fast, too. I realized after I already had a working solution that I had been compiling without optimizations the whole time, so I decided to try enabling them, only to discover that the actual execution time wasn't appreciably different. I figured the overhead of spawning a process was probably eating the difference, so I tried timing just the execution of the main loop and sure enough, without optimizations it took about 2 milliseconds whereas with optimizations it was 690 microseconds. Whee! Native-compiled languages are so fun. I'm too lazy to try rewriting this in Python just to see how much slower it would be, but I'm _pretty_ sure that this time it would be quite noticeable.
Anyway, that about wraps it up for Fortran. My only remaining question is: What is the appropriate demonym for users of Fortran? Python has Pythonistas, Rust has Rustaceans, and so on. I was going to suggest "trannies" for Fortran users, but everyone kept giving me weird looks for some reason.

View File

@ -8,7 +8,7 @@ draft: true
import Sidenote from '$lib/Sidenote.svelte';
</script>
I use Kubernetes on my personal server, largely because I wanted to get some experience working with it. It's certainly been helpful in that regard, but after a year and a half or so I think I can pretty confidently say that it's not the ideal tool for my use-case. Duh, I guess? But I think it's worth talking about _why_ that's the case, and what exactly _would_ be the ieal tool.
I use Kubernetes on my personal server, largely because I wanted to get some experience working with it. It's certainly been helpful in that regard, but after a couple of years I think I can pretty confidently say that it's not the ideal tool for my use-case. Duh, I guess? But I think it's worth talking about _why_ that's the case, and what exactly _would_ be the ieal tool.
## The Kubernetes Way™
@ -42,7 +42,7 @@ Where Kubernetes is intrusive, we want to be transparent. Where Kubernetes is fl
The basic resources of servering are ~~wheat~~ ~~stone~~ ~~lumber~~ compute, storage, and networking, so let's look at each in detail.
## Compute
### Compute
"Compute" is an amalgamate of CPU and memory, with a side helping of GPU when necessary. Obviously these are all different things, but they tend to work together more directly than either of them does with the other two major resources.
@ -54,7 +54,7 @@ I'm not entirely sure this needs to be the case! Sure, for systems like Kubernet
The obvious counterpoint is that distributing the system isn't just for scale, it's also for resiliency. Which is true, and if you don't care about resiliency at all then you should (again) probably just be using Harbormaster or something. But here's the thing: We care about stuff running _on_ the cluster being resilient, but how much do we care about the _control plane_ being resilient? If there's only a single control node, and it's down for a few hours, can't the workers just continue happily running their little things until told otherwise?
We actually have a large-scale example of something sort of like this in the recent Cloudflare outage.
We actually have a large-scale example of something sort of like this in the Cloudflare outage from a while back: Their control plane was completely unavailable for quite a while (over a day if I recall corectly), but their core CDN and anti-DDoS services seemingly continued to function pretty well.
### Virtualization
@ -76,7 +76,34 @@ So we're going to use Docker _images_ but we aren't going to use Docker to run t
Locked-down by default. You don't trust these apps, so they don't get access to the soft underbelly of your LAN. So it's principle-of-least-privilege all the way. Ideally it should be possible when specifying a new app that it gets network access to an existing app, rather than having to go back and modify the existing one.
## Storage
## Storage is yes
Kubernetes is famous for kinda just punting on storage, at least if you're running it on bare metal. Oh sure, there are lots of [storage-related resources](https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/persistent-volume-v1/), but if you look closely at those you'll notice they mostly just _describe_ storage stuff, and leave it up to the cluster operator to bring their own _actual implementation_ that provisions, attaches, maintains, and cleans up the ~~castles in the sky~~ PersistentVolumeClaims and StorageClasses and whatnot.
This makes sense for Kubernetes because, although it took me an embarassingly long time to realize this, Kubernetes has never been about enabling self-hosting. Its primary purpose has always been _cloud mobility_, i.e. enabling you to pick up your cloud-hosted systems and plonk them down over in a completely different cloud. Unfortunately this leaves the self-hosting among us out in the cold: since we don't typically have the luxury of EBS or its equivalent in our dinky little homelabs, we are left to bring our own storage systems, which is something of a [nightmare hellscape of doom](https://kubernetes-csi.github.io/docs/introduction.html).
I want my hypothetical storage system to completely flip this on its head. There _should_ be a built-in storage implementation, and it _should_ be usable when you're self-hosting, with minimal configuration. None of this faffing about with the layers and layers of abstraction hell that Kubernetes forces on you as soon as you give in to the siren song of having a single stateful application in your cluster. If I want to give my application a persistent disk I should jolly well be able to do that with no questions asked.
## Sounds great, but how?
For starters, we're going to give up on synchronous replication. Synchronous replication is one of those things that _sounds_ great, because it makes your distributed storage system theoretically indistinguishable from a purely-local filesystem, but having used a [storage system that prioritizes synchronous replication](https://longhorn.io/) I can pretty confidently say that I would be much happier without it. It absolutely _murders_ performance, causing anything from a 3x-20x slowdown in my [testing](https://serverfault.com/a/1145529/409057), and the worst part of that is that I'm pretty sure it's _completely unnecessary_.
Here's the thing: You only _really_ need synchronous replication if you have multiple instances of some application using the same files at the same time. But nobody actually does this! In _any_ clustering setup I've ever encountered, you handle multi-consumer access to persistent state in one of three ways:
1. You delegate your state management to something else that _doesn't_ need to run multiple copies, i.e. the "replicate your web app but run one DB" approach,
2. You shard your application and make each shard the exclusive owner of its slice of state, or
3. You do something really fancy with distributed systems theory and consensus algorithms.
Here's the thing, though: _none of these approaches require synchronous replication._ Really, the _only_ use case I've found so far for _actually_ sharing state between multiple instances of the same application are things like Docker registry layer storage, which is a special case because it's basically a content-addressed filesystem and therefore _can't_ suffer from write contention. Maybe this is my lack of experience showing, but I have a lot of difficulty imagining a use-case for simultaneous multi-writer access to the same files that isn't better served by something else.
Conceptually, then, our storage system will consist of a set of directories somewhere on disk which we mount into containers, and which are _asynchronously_ replicated to other nodes with a last-writer-wins policy. Actually, we'll probably want to have multiple locations (we can call them "pool" like ZFS does) on disk so that we can expose multiple different types of media to the cluster (e.g. small/fast, large/slow).
This is super simple as long as we're willing to store a full copy of all the data on every node. That might be fine! But I lean toward thinking it's not, because it's not all that uncommon in my experience to have a heterogenous "cluster" where one machine is your Big Storage Monster and other machines are much more bare-bones. There are two basic ways of dealing with this:
1. We can restrict scheduling so that workloads can only be scheduled on nodes that have a copy of their data, or
2. We can make the data accessible over the network, SAN-style.
My inclination is to go with 1) here, because 2) introduces some pretty hefty performance penalties. We could maybe mitigate that with aggressive caching, but now you've got wildly unpredictable performance for your storage based on whether the data is in cache or not. Practically--remember, we're targeting _small_ setups here--I don't think it would be much of a problem to specify a set of nodes when defining a storage pool, or even just make pools a node-local configuration so that each node declares what pools it participates in, and then replicate each pool to every participating node. Again, we're not dealing with Big Data here, we don't need to spread our storage across N machines because it's literally too big to fit on one.
Kubernetes tends to work best with stateless applications. It's not entirely devoid of [tools](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) for dealing with state, but state requires persistent storage and persistent storage is hard in clusters.<Sidenote>In fact, I get the sense that for a long time you were almost completely on your own with storage, unless you were using a managed Kubernetes project like GKE where you're just supposed to use whatever the provider offers for storage. More recently things like Longhorn have begun improving the situation, but "storage on bare-metal Kubernetes" still feels decidedly like a second-class citizen to me.</Sidenote>

View File

@ -7,7 +7,7 @@ date: 2024-07-06
import Sidenote from '$lib/Sidenote.svelte';
</script>
Like a lot of people, my main experience with private keys has come from using them for SSH. I'm familiar with the theory, of course - I know generally what asymmetric encryption does,<Sidenote>Although exactly _how_ it does so is still a complete mystery to me. I've looked up descriptions of RSA several times,<Sidenote>Testing nested notes again.</Sidenote> and even tried to work my way through a toy example, but it's never helped. And I couldn't even _begin_ to explain elliptic curve cryptography beyond "black math magic".</Sidenote> and I know that it means a compromised server can't reveal your private key, which is nice although if you only ever use a given private key to SSH into your server and the server is already compromised, is that really so helpful?<Sidenote>Yes, yes, I know that it means you can use the same private key for _multiple_ things without having to worry, but in practice a lot of people seem to use separate private keys for separate things, and even though I'm not entirely sure why I feel uncomfortable doing otherwise.</Sidenote>
Like a lot of people, my main experience with private keys has come from using them for SSH. I'm familiar with the theory, of course - I know generally what asymmetric encryption does,<Sidenote>Although exactly _how_ it does so is still a complete mystery to me. I've looked up descriptions of RSA several times, and even tried to work my way through a toy example, but it's never helped. And I couldn't even _begin_ to explain elliptic curve cryptography beyond "black math magic".</Sidenote> and I know that it means a compromised server can't reveal your private key, which is nice although if you only ever use a given private key to SSH into your server and the server is already compromised, is that really so helpful?<Sidenote>Yes, yes, I know that it means you can use the same private key for _multiple_ things without having to worry, but in practice a lot of people seem to use separate private keys for separate things, and even though I'm not entirely sure why I feel uncomfortable doing otherwise.</Sidenote>
What I was less aware of, however, was the various ways in which private keys can be _stored_, which rather suddenly became a more-than-purely-academic concern to me this past week. I had an old private key lying around which had originally been generated by AWS, and used a rather old format,<Sidenote>The oldest, I believe, that's in widespread use still.</Sidenote> and I needed it to be comprehensible by newer software which loftily refused to have anything to do with such outdated ways of expressing itself.<Sidenote>Who would write such obdurately high-handed software, you ask? Well, uh. Me, as it turns out. In my defense, though, I doubt it would have taken _less_ time to switch to a different SSH-key library than to figure out the particular magic incantation needed to get `ssh-keygen` to do it.</Sidenote> No problem, thought I, I'll just use `ssh-keygen` to convert the old format to a newer format! Unfortunately this was frustratingly<Sidenote>And needlessly, it seems to me?</Sidenote> difficult to figure out, so I'm writing it up here for posterity and so that I never have to look it up again.<Sidenote>You know how it works. Once you've taken the time to really describe process in detail, you have it locked in and never have to refer back to your notes.</Sidenote>
@ -41,7 +41,7 @@ So, I thought, I can use `ssh-keygen` to convert between these various and sundr
Well, yes. It _can_, but good luck figuring out _how_. For starters, like many older CLI tools, `ssh-keygen` has an awful lot of flags and options, and it's hard to distinguish between which are _modifiers_ - "do the same thing, but differently" - and _modes of operation_ - "do a different thing entirely". The modern way to handle this distinction is with subcommands which take entirely different sets of arguments, but `ssh-keygen` dates back to a time before that was common.
It also dates back to a time when manpages were the primary way of communicated detailed documentation for CLI tools,<Sidenote>These days it seems more common to provide a reasonably-detailed `--help` output and then just link to web-based docs for more details.</Sidenote> which you'd _think_ would make it possible to figure out how to convert from one private key format to another, but oh-ho-ho! Not so fast, my friend. Here, feast your eyes on this:
It also dates back to a time when manpages were the primary way of communicating detailed documentation for CLI tools,<Sidenote>These days it seems more common to provide a reasonably-detailed `--help` output and then just link to web-based docs for more details.</Sidenote> which you'd _think_ would make it possible to figure out how to convert from one private key format to another, but oh-ho-ho! Not so fast, my friend. Here, feast your eyes on this:
```
-i This option will read an unencrypted private (or public) key file in the format specified by the -m option and print an

View File

@ -0,0 +1,72 @@
---
title: Why the Internet is Terrible
date: 2024-11-16
---
<script>import Sidenote from '$lib/Sidenote.svelte';</script>
I just got done deleting ~30 bogus user accounts from my [personal Gitea insteance](https://git.jfmonty2.com). They all had reasonable-ish-sounding names, one empty repository, and profiles that looked like [this](/bogus_user_profile.jpg). Note the exceedingly spammy link to a real site (still up as of writing) and the ad-copy bio.
Obviously this is just SEO spam. My Gitea instance got found by some automated system that noticed it had open registration,<Sidenote>The more fool I.</Sidenote> so it registered a bunch of bogus user accounts, added links to whatever sites it was trying to pump, added related text in the bio, and then sat back and waited for search engines to pick up on these new backlinks and improve the reputation of said sites, at least until the search engines catch on and downgrade the reputation of my Gitea instance.
This particular problem was easy enough to deal with: Just remove the offending users, and all their works, and all their empty promises. But it got me thinking about the general online dynamic that _everybody online is out to get you._
## The Internet is terrible, and everyone knows it
This isn't a news, of course. People go around [saying things like](https://www.stilldrinking.org/programming-sucks):
>Here are the secret rules of the internet: five minutes after you open a web browser for the first time, a kid in Russia has your social security number. Did you sign up for something? A computer at the NSA now automatically tracks your physical location for the rest of your life. Sent an email? Your email address just went up on a billboard in Nigeria.
and everyone just smiles and nods, because that's what they've experienced. I've encountered people who are highly reluctant to pay for anything online via credit card--they would much rather use the phone and give their credit card number to a real person who is presumably capable of stealing it, should they so desire--because the general terribleness of the internet has become so ingrained into their psyche that this feels like the better option, and you know what? I can't even blame them.
Anyone who works on web applications for a living (or a hobby) is _especially_ aware of this, because odds are that they've been burned by it already or at least are familiar with any number of existing examples. The very existence of sites like [Have I Been Pwned](https://haveibeenpwned.com) is predicated on the inescapable terribleness the permeates every nook and cranny of the Internet.
Of course, people trying to take advantage of the careless and clueless isn't a new phenomenon. The term "snake oil salesman" dates back to the 18th century and refers people who would go around selling _literal snake oil_<Sidenote>Probably not harvested from actual snakes, but they sure told people it was.</Sidenote> as a miracle cure, hair restorative, and whatever else. I'm fairly confident that as long as money has existed, there have been unscrupulous people making a living off of tricking it out of other people.
But something about the Internet makes it much more _present_, more in your face, than old-timey snake-oil salesmen. I've seen no hard numbers on this, and I don't know how you would even begin to estimate it, but but I would guess that the incidence rate of this sort of thing is vastly higher online than it's ever been in meatspace.
So what is it about the Internet that makes deception so much more prevalent? Ultimately, I think it boils down to three things: availability, automation, and anonymity. The Three A's of Awfulness, if you will.
## You're in the bad part of town
Have you ever wondered why physical locks are so easy to pick? It takes some know-how, but from what I can tell, most commonly-sold locks [can be bypassed within a minute](https://www.youtube.com/@lockpickinglawyer/videos). I'm just going to say it right here, and I don't think this is a controversial take: For a web application that would be an unacceptably low level of security. If it took an attacker less than a minute to, say, gain administrative access to a web application, I'd consider it just this side of "completely unsecured".
But! Meatspace is not the internet. The constraints are different. Over the lifetime of a given lock, the number of people who will ever be in a position to attempt to pick it is usually quite low, compared to the number of people who exist in the world. Of course, the circumstances matter a lot too: A lock in a big city is within striking distance of many more potential lock-pickers than the lock on a farm out in corn country somewhere, which is part of why people in cities are frequently much more concerned about keeping their doors locked than people in rural areas. And within a single city, people who live in the bad parts of town tend to worry more than people who don't, etc.
But on the Internet, everyone is in the bad part of town _all the time!_ That's right, there's nothing separating your podunk website from every aspiring journeyman member of Evil Inc. except a few keystrokes and a click or two. It doesn't take Sir Scams-A-Lot any longer to send an email to you than to your less-fortunate neighbors in the housing projects, and so on.<Sidenote>This is also my beef with [this xkcd comic](https://xkcd.com/1958/). The real danger isn't that people will do things to the _physical_ environment to mess with self-driving cars (like repainting lines on the road), but that they'll do something remotely from the other side of the world, and no one will know until their car drives off a bridge or whatever. And sure, most people aren't murderers. But even if there are only a few people in the world who are sufficiently unhinged as to set up fatal traffic accidents between total strangers, _if your self-driving car is Internet-connected then those people might have the opportunity._</Sidenote>
In other words, the size of the "target pool" for someone who has a) an Internet connection and b) no conscience is _literally everyone else with an internet connection._ At last count, that number was in the billions and rising. This alone would make "online scurrilousness" a far more attractive career choice than "cat thief", but don't worry, it gets even worse!
## Their strength is as the strength of ten
You might be tempted to think something like "Sure, being online gives the seamier sort of people immediate access to basically everyone in the world. But that shouldn't really change the overall incidence of these sorts of things, because after all, there are only so many hours in the day. A hard-working evildoer can still only affect a certain number of people per unit time, right? _right?_" But alas, even this limitation pales before the awesome might of modern communications infrastructure.
In meatspace, you can only be in one place at a time. If you're over on Maple Street burglarizing Mr. and Mrs. Holyoke's home, you can't also be selling fake stock certificates on Jefferson Ave, or running a crooked blackjack game in the abandoned warehouse off Stilton. But we aren't in meatspace any more, ~~Toto~~. We're _online_, where everything is done with computers. You know what computers really love doing? _Endlessly repeating the same boring repetitive task forever._ The Internet is a medium uniquely suited to automated consumption. In fact, approximately 30% of all internet traffic comes from automated systems, [according to Clouflare](https://radar.cloudflare.com/traffic#bot-vs-human), and they should know.
So what does a clever-but-unscrupulous technologist do? That's right, he goes looking for vulnerabilities in widely-used platforms like Wordpress, finds one, then sets up an automated system to identify and exploit vulnerable Wordpress installs. Or he uses an open-source large language model like [Llama](https://www.llama.com/) to send phishing emails to every email address he can get his hands on, and maybe even correspond with susceptible people across multiple messages,<Sidenote>This is something I'm sure we'll see more and more of as time goes on. I'm sure it's already happening, and it's only going to get worse.</Sidenote> or just tricks people into clicking on a link to a fake Log In With Google page where he snarfs up their usernames and passwords, or _whatever_. There are a million and one ways an unethical person can take advantage of others _without ever having to personally interact with them._ This acts as a force-multiplier for evil people, and I think it's a major contributor to the overwhelming frequency with which you encounter this sort of thing online.<Sidenote>Astute readers may realize that while you can't automate meatspace in exactly the same way as you can automate computers, you can still do the next-best thing: _get other people to do it for you._ This is the fundamental insight of the Mafia don, and organized crime more generally. Thing is, though, all of these subsidiary evildoers have to be just as willing to break the law as the kingpin string-puller, so it doesn't quite act as a force-multiplier for evil in the same way.</Sidenote>
Interestingly, the automate-ability of anything that happens over the Internet seems to have leaked back into the phone system as well. I don't think anybody would disagree that scam phone calls are far more common than they used to be.<Sidenote>Unless "Dealer Services" has developed a truly pathological level of concern for the vehicle warranty I didn't even know I had.</Sidenote> I suspect, although I don't have any hard evidence to back it up, that this is largely due to the ease with which you can automate phone calls these days via internet-to-phone bridge services like [Twilio](https://twilio.com). The hit rate for this sort of thing has to be incredibly low--especially as people start to catch on and stop answering calls from numbers they don't know--so it only makes sense for the scammer if it costs them _virtually nothing_ to attempt.
One might ask why this wasn't the case before the Internet, since auto-dialing phone systems certainly predate the widespread use of the Internet,<Sidenote> The [Telephone Consumer Protection Act](https://en.wikipedia.org/wiki/Telephone_Consumer_Protection_Act_of_1991) attempted to regulate them as far back as 1991!</Sidenote> so why didn't this happen then? I suspect that again, this comes down to ease of automation. In the 90s, you needed expensive dedicated equipment to set up a robocalling operation, but today you can just do it from your laptop.
## The scammer with no name
There's a third contrast with meatspace that makes life easier for people whose moral compass has been replaced by, say, an avocado: _Nobody knows who you are online._ In real life, being physically present at the scene of a crime exposes you to some degree of risk. There might be witnesses or security cameras, your coat might snag on a door and leave some fibers behind for the forensic team to examine, you might drop some sweat somewhere and leave DNA lying around, and of course there are always good ol' fingerprints.<Sidenote>Once again, the Mafia model demonstrates how you might insulate yourself from some of these risks, but again, it's not quite as complete because _somebody_ has to be there, and that somebody might talk. And yes, the Mafia [took steps](https://en.wikipedia.org/wiki/Omert%C3%A0) to remedy that problem as well, but that's why Witness Protection was invented.</Sidenote>
All of this is much less of an issue online. In fact, one of the loudest and most attention-seeking hacking groups literally just called themselves [Anonymous](https://en.wikipedia.org/wiki/Anonymous_(hacker_group)). Of course, [then a bunch of them got arrested](https://www.bbc.com/news/world-latin-america-17195893), so maybe they weren't _quite_ as anonymous as they seemed to think they were. Still, I think it's safe to say that it's a lot easier to stay anonymous when you're committing crimes online vs. in person. Or from another angle, it takes (on average) significantly more law-enforcement effort to de-anonymize a criminal online than in person.<Sidenote>I can't seem to find it any more, but I'm pretty sure I remember reading an article a while back that talked about how the NSA/FBI/etc. managed to identify people like [Silk Road](https://en.wikipedia.org/wiki/Silk_Road_(marketplace)) higher-ups. From what I recall, it was pretty resource-intensive and not really realistic except for high-priority targets.</Sidenote>
I'm pointing out the downsides here, of course, but it's worth noting that online anonymity is a coin with two faces. It's fundamental to the question of privacy, especially from governments who would love nothing better than to know every sordid detail of their citizens' lives forever.<Sidenote>Don't believe me? Just look at how hard any number of major governments have been trying to effectively outlaw things like end-to-end encrypted chat apps. Here's the [UK](https://www.wired.com/story/britain-admits-defeat-online-safety-bill-encryption/), [US](https://www.eff.org/deeplinks/2020/06/senates-new-anti-encryption-bill-even-worse-earn-it-and-thats-saying-something), [Australia](https://www.schneier.com/blog/archives/2024/09/australia-threatens-to-force-companies-to-break-encryption.html), etc. They don't give a crap about "safety" or "exploitative content". This is about surveillance. </Sidenote> In general, anything that improves privacy (such as end-to-end encryption, VPNs, proxies, etc.) also makes anonymity easier for people whose motives are less laudable than "I don't think the government should know everything bout me."
## The economics of evil
In the end, you can think of this all as a question of economics.<Sidenote>Seems like you can think of anything as a question of economics, if you try hard enough. [Even theology](https://en.wikipedia.org/wiki/Economy_of_Salvation).</Sidenote> The Internet is rife with scams, thievery, and general [scum and villainy](https://www.youtube.com/watch?v=Xcb4_QwP6fE) because it brings down the cost of doing such things to the point that it becomes worth it. There's no need to spend time or money moving from place to place, because you can do it all from the comfort of your own home. Instead of spending time on each individual operation you can put in the effort to automate it up-front and then sit back and reap the benefits (or keep finding more things to automate). The risk of doing all of this (which is a form of cost) is significantly lower than it would be to do something equivalent in real life. And all of this you get for the low, low price of your immortal soul! What's not to like?
## Will it ever change?
The Internet has often reminded me, alternately, of a) the Industrial Revolution and b) the Wild West. It reminds me of the Industrial Revolution because there are great examples of unscrupulous people taking advantage of a new set of economic realities to make tons of money at the expense of poor everyday folk who are just trying to live their lives. And not just straight-up criminals like we've been discussing, but also exploitative businesses and corporations (adtech, anybody?) that hearken back to the days of e.g. factory owners profiting from the slow destruction of their workers' lives. But the Internet also calls to mind the Wild West of the mid-to-late 1800s. Like the Wild West, it's a huge new swathe of unexplored territory rich with opportunity, if a little uncivilized.
But eventually, both the Industrial Revolution and the Wild West settled down and got a little more civilized. Eventually people developed things like labor unions and OSHA regulations,<Sidenote>Which I never thought I'd be holding up as a _good_ thing, because in my personal experience they've mostly been a source of frustration. But something tells me that if I were a worker in a 19th-century textile factory, I would have been very glad for some basic safety requirements.</Sidenote> and the world of heavy industry got a little more equitable. And eventually, the Wild West became civilized enough that you couldn't just walk into a saloon and shoot someone just because you felt like it.<Sidenote>Please note, I have no idea if this was ever _really_ possible, I'm basing it mostly on spaghetti Westerns and the like.</Sidenote>
Will the same thing happen to the Internet? I don't know. It might! Already you can start to see a sort of social "immune system" developing with regard to things like phishing emails and calls. For instance, I know plenty of people who have a policy of never answering their phone at all if the call is from a number they don't recognize.<Sidenote>Consumer Reports [claims](https://www.consumerreports.org/robocalls/mad-about-robocalls/) that this is actually 70% of US adults, which is a staggering number. Heaven help us if the scammers figure out how to reliably spoof numbers from people you know.</Sidenote> Unfortunateloy it's harder to make this work for something like poorly-secured web services, because it isn't easy to tell before you sign up for a service whether it's likely to get breached and leak your personal info in six months.
Ultimately the only workable solutions will have to a) increase the cost of carrying out these attacks, or b) reduce (on average) the reward. In the end it probably won't be _solved_ completely, much like crime isn't _solved_ today. But I'm hopeful that, much like today's Texans don't have to worry much about their stagecoach being waylaid by bandits, we'll see less and less of it as time goes on.

Binary file not shown.

After

Width:  |  Height:  |  Size: 119 KiB