diff --git a/src/routes/_posts/advent-of-languages-2024-03.svx b/src/routes/_posts/advent-of-languages-2024-03.svx new file mode 100644 index 0000000..7eeaf79 --- /dev/null +++ b/src/routes/_posts/advent-of-languages-2024-03.svx @@ -0,0 +1,325 @@ +--- +title: 'Advent of Languages 2024, Day 3: Forth' +date: 2024-12-07 +--- + + + +My original plan was to stick with the "systems language" them for Day 3 and go with Zig, but the more I thought about it the more I started to think I should save a nice, clean, modern language like Zig, that hasn't had time to get all warty yet and at has common data structures in its standard library, for one of the later days. Also I looked at Day 3 and it (the first part at least) looks very simple, which makes me even less inclined to use a big honkin' heavy-duty language like Zig. Instead, today I'm going to take a look at Forth!I know, I know, I would have been able to make all kinds of terrible jokes had I just waited for the _forth_ day of the AoC, but hey, we can't all get what we want. + +## Go Forth And C No More + +Forth is an old language, older even than C (by a few years at least), so you know right away it's going to be lacking a lot of modern conveniences like local variables or even, you know, structs. With named fields and all? Yep, not here.I latere discovered that this is implementation-specific--some Forths _do_ have structs, but others don't. Forth is a [stack-oriented](https://en.wikipedia.org/wiki/Stack-oriented_programming) language, which I _think_ is a different kind of stack from the "stack/heap" you deal with in systems languages, although I might be wrong about that. It's also _aggresively_ simple, both syntactically and conceptually. Syntactially, you could fit the grammar on the back of your hand. It's so slim that even _comparison operators_ like `<` and `=` (that's a comparison operator in Forth, like in SQL) are implemented as part of the _standard library_. Conceptually it's (if anything) even simpler.Note that "simple" is not the same thing as "easy". C's memory model is simple: allocate memory, free memory. Don't free more than once, and don't free if it's still in use. Done! That doesn't stop it from being the root cause of some of the worst [security bugs](https://heartbleed.com/) of all time, or [massive worldwide outages](https://en.wikipedia.org/wiki/2024_CrowdStrike-related_IT_outages) affecting everything from banks to airlines. There's a stack, and you put things onto it, and you take them off. Done.Ok, ok, it's not _quite_ that simple. There are branching constructs, for instance (loops and conditionals), so not _everything_ can be modeled as pure stack operations. But it sure is a _lot_ simpler than most languages. + +Forth is--well, I think it's a bit of a stretch to describe it as "common" in any circumstance, but perhaps we can say "best-represented" in embedded systems contexts, where resources are often very heavily constrained and you're typically operating very close to the hardware. In fact it thrives in an environment where there's just a single contiguous block of not-very-much memory, because of the whole stack paradigm. + +So of course, I'm going to use it for Advent of Code, where I have 32GB of RAM and a full-fat multi-process OS with extremely sophisticated thread scheduling, memory management, and so on. I wonder about myself sometimes. + +## Where Get? + +Like a lot of older languages, there isn't a single official implementation of Forth which is your one-stop shop for compilers, linters, formatters, editor plugins, language servers, and all the various and sundry paraphernelia that have become de rigeur for a new language these days.I don't know for sure what drives this difference, but my guess is that the Internet has made it much easier to coordinate something like programming-language development across a widely-separated group of people and organizations, so people naturally end up pooling their resources these days and all contributing to the One True Implementation. It has a [standard](https://forth-standard.org/), and there are various implementations of that standard (scroll down to the "Systems" section on that page). [Gforth](https://gforth.org/) looks like the easiest to get up and running in, so let's give that a try. + +The front page of the Gforth website has instructions for adding the Debian repository, but unfortunately that repository doesn't seem to be online anymore, so let's try building it from source, for which there are also instructions on the front page. The Git repository at least is hosted at a subdomain of `gnu.org`, so it might still be online? + +_a brief interlude_ + +Okay, I'm back, and what do you know? It seems to have worked. There was a warning about `Swig without -forth feature, library interfaces will not be generated.`, but hopefully that's not important. + +Okay, so we're good to go, right? At least, I can run `gforth` now and get a REPL, so I think we're good. + +Wait a second, I forgot to run `make install`. But it's on my PATH already! What? + +``` +$ ls -l /usr/bin/gforth + +>>> lrwxrwxrwx 1 root root 12 Sep 10 2021 /usr/bin/gforth -> gforth-0.7.3 +``` + +...It was _already installed_?! + +Well, paint me purple and call me a grape if that isn't the most embarrassing thing I've done all day. + +## But The Real Treasure Was The Friends We Made Along The Way + +I hope you enjoyed this brief exercise in futility. Let's just move on and pretend it never happened, hmmm? + +Let's start ForthOk I'll stop now, I promise. the same place we started C and C++, with hello-world. Unfortunately `highlight.js` has no syntax for Forth, so you'll just have to imagine the colors this time. + +```fs +." hello world " +``` + +``` +$ gforth 03.fs + +>>> hello world Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc. +Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license' +Type `bye' to exit +``` + +Uh. Okay, it looks like Forth doesn't exit unless you explicitly end your program with the `bye` command, it just does whatever you specified and then dumps you into a REPL so you can keep going, if you want. Hey, I wonder if giving it a non-interactive stdin would make a difference? + +``` +$ gforth >> hello world Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc. +Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license' +Type `bye' to exit +``` + +Nope. I mean, it doesn't sit there waiting for input, of course. But it does print the start message. No, the only way to keep it from doing that is to put `bye` at the end of the script. Fascinating. + +You may be struck by the spaces _inside_ the quote marks here: this is necessary. Without them, for example, `."hello` would be interpreted as a single token, and since there is no such "word" (the Forth term for functions, basically) defined it would error.Actually, I discovered later that only the space after the _first_ quote mark is necessary. The second one is superfluous, and in fact gets interpreted as part of the string. + +## Day 3 + +Okay, so today we have a very minimal parsing challenge. We're given a long string of characters (just a single line this time) that contains a bunch of garbage and also some valid instructions, we have to parse out the instructions and execute them. There's only one instruction, `mul(x,y)` where x and y are numbers. Everything else is to be discarded, for Part 1 at least. I have a feeling that for Part 2 we may find out some of that garbage wasn't quite so garbacious after all. + +This would be absolutely trivial with even the most basic of regular-expression libraries, but we're in Forth land here, so--wait. Maybe I should check some things before I make confident assertions. + +...Yep, Gforth [does in fact](https://gforth.org/manual/Regular-Expressions.html) include regular expressions. Using them looks [pretty wild](https://git.savannah.gnu.org/cgit/gforth.git/tree/test/regexp-test.fs), though. I think I need to go read some more tutorials. + +... + +I'm back! It's been two days. Forth is confusing! + +### Data Loadin' + +I've figured out how to at least open the file and read the data into memory, though, which is Great Success in my book. Here, let me show you: + +```fs +create data 20000 allot +variable fileid +s" data/03.txt" r/o open-file +drop +fileid ! + +variable data-len +data 20000 fileid @ read-file +drop +data-len ! +``` + +After executing this with `gforth`, I am dumped into an interactive Forth REPL, as before, but now I have a 20K buffer at the address specified by `data` containing the contents of my puzzle input, which is great! + +Forth is _weird_ to anybody coming from the more standard type of language that has, you know, syntactic function calls and operators and all of that fancy stuff. None of that here! Remember how earlier I said that comparison operators were part of the standard library? Well yeah, it turns out that Forth doesn't even _have_ operators the same way other languages I've used do.I'm aware that a lot of languages use operators as syntax sugar for functions defined somewhere else, but syntax sugar is just that: _syntax_. Forth doesn't make even a _syntactic_ distinction between operators and regular function calls. It's the equivalent of calling `add(a, b)` every time you want to add some numbers. There are symbols like `>` and `=`, but really those are just conveniently-named "words" (again, the Forth name for a function) which operate on values from the stack just like other words with longer names. Crazy, huh? + +Anyway, I want to go through this line-by-line because it's so far afield. Like I said, it took me 2 days to get this far. Noted that because of the whole stack-oriented thing, Forth is syntactically backwards for most operations - you put the arguments _first_, thus putting them on the stack, then call the operation, which takes them off the stack and leaves its result (if any) on the stack. So to add 6 and 7 you would do `6 7 +`, which results in a stack containing 13. Wonky, I know. + +* `create data 20000 allot`--Ok, I'm not actually 100% sure what this part is doing, I stole it from the gforth [tutorial](https://gforth.org/manual/Files-Tutorial.html#Open-file-for-input) on files, which is... minimal, shall we say? I think what's happening is that we're allocating memory, but we're "stack-allocating" (In the C sense, not the Forth sense). `create data` is actually the rare Forth construct that _isn't_ backwards, I think because it's a _compile-time_ operation rather than a runtime operation: we are `create`-ing a word, `data`, whose operation is simply to retun its own address. Then we are `allot`-ing 20K bytes, starting _immediately after that address_, I think? Anyway, the net result is that we end up with 20K of writable memory at the address returned by `data`, which is good enough for me. +* `variable fileid`--Another non-backwards construct here, `variable`. Again, I think this is because it creates a new _word_, which makes it a compile-time operation. I'm pretty sure `variable` is actually just a layer on top of `create ... allot`, because it does basically the same thing, it just only reserves enough space for a single integer. (Forth's native integer type is a signed 32-bit integer, by the way.) +* `s" data/03.txt" r/o open-file`--Now we're cooking with gas! `s" ..."` creates a string somewhere in memory (I think it actually allocates, under the hood) and puts the address and length of that string on the stack. `r/o` is just a convenience word that puts `0` on the stack; it represents the mode in which we're opening the file. `open-file` opens the file, using three values from the stack: 1) the address of the string containing the name of the file, 2) the length of that string, and 3) the mode. It returns two numbers to the stack, 1) the "file id" and 2) a status code. +* `drop`--I'm going to ignore the status code, so I `drop` it from the stack. +* `fileid !`--At this point we have the file ID on the stack, but we want to save it for future reference, so we store it at the address returned by `fileid`. The `!` word is just a dumb "store" command, it takes a value and an address on the stack and stores the value at that address. The reciprocal operation is `@`, which we'll see in a moment. +* `variable data-len`--Create a variable named `data-len` (again, just a word named `data-len` that returns an address where you can write data) +* `data 20000 fileid @ read-file`--This is the goods! We are reading (up to) 20K bytes from the file whose ID is stored in the variable `fileid` into the region of memory starting at address `data`. Note the `fileid @` bit, I got hung up here for the _longest_ time because I kept trying to just use `fileid` by itself. But that meant that I was giving `read-file` the _address of the variable itself_, not the value contained _at that varible_. In C terms, `@` is like dereferencing a pointer. (In fact, under the hood that's probably exactly what's going on, I think gforth is implemented largely in C.) +* `drop`--That last operation returned the number of bytes read, which we want, and a status code, which we don't. So we drop the status code. +* `data-len !`--Store the number of bytes read in the variable `data-len`. + +At this point we have the contents of our file in memory, hooray! Boy, Forth is low-level. In some ways it's even lower-level than C, which is mind-boggling to me. + +### Wait, regular expressions solved a problem? + +Ok, so now we have our input, we need to process it. + +My first attempt at this was an extremely laborious manual scanning of the string, attempting to parse out the digits from valid `mul(X,Y)` sequences, multiply them, keep a running total, etc. I got lost in all the complexity pretty quickly, so eventually I decided to just knuckle down and figure out the gforth regular expression library. And you know what? It wasn't too bad after all! Here's the regex I ended up coming up with: + +```fs +require regexp.fs + +: mul-instr ( addr u -- flag ) + (( =" mul(" \\( {++ \\d ++} \\) ` , \\( {++ \\d ++} \\) ` ) )) ; +``` + +`: name ... ;` is the syntax for defining a new word, by the way. In this case, the body of the word _is_ the regex. I'm pretty sure this is required, because it's doing its magic at compile time, again--at least, if I try using a "bare" regex in the REPL, outside of a word, I get nothing. This seems to be the deal for most Forth constructs that have both a start and an end delimiter, like if/then, do/loop, and so on. The exceptions are of course the define-a-word syntax itself, and also string literals for some reason. Not sure why that is, but I'm guessing they're just a special case because inline string literals are just _so_ useful. + +Anyway, to elaborate a bit on the regex: + +* `((` and `))` mark the start and end of the regex. +* `=" mul("` means "match the literal string `mul(`. +* `\(` and `\)` delimit a capturing group. +* `{++` and `++}` delimit a sequence repeated one-or-more times (like `+` in most regex syntaxes). +* A backtick followed by a single character means to match a single occurence of that character. + +So this looks for the literal string `mul(`, followed by 1 or more digits, followed by the character `,` followed by 1 or more digits, followed by the character `)`. Not too bad, once you get your head around it. + +Oh, and referencing captured groups is just `\1`, `\2`, `\3` etc. It seems like you can do this at any time after the regex has matched? I guess there's more global state going on inside the regex library somewhere. + +Anyway, here's my full solution to Day 3 Part 1! + +```fs +create data 20000 allot \\ create a 20K buffer +variable fileid \\ create a variable `fileid` +s" data/03.txt" r/o open-file \\ open data file +drop \\ drop the top value from the stack, it's just a status code and we aren't going to bother handling errors +fileid ! \\ save the file id to the variable + +variable data-len +data 20000 fileid @ read-file \\ read up to 20k bytes from the file +drop \\ drop the status code, again +data-len ! \\ store the number of bytes read in `data-len` + +: data-str ( -- addr u ) + \\ convenience function for putting the address and length of data on the stack + data data-len @ ; + +: chop-prefix ( addr u u2 -- addr2 u2 ) + \\ chop the first `u2` bytes off the beginning of the string at `addr u` + tuck \\ duplicate `u2` and store it "under" the length of the string + - \\ subtract `u2` from the length of the string + -rot \\ stick the new string length underneath the start pointer + + \\ increment the start pointer by `u2` + swap \\ put them back in the right order +; + +require regexp.fs + +: mul-instr ( addr u -- flag ) + \\\\ match a string of the form `mul(x,y)` where x and y are integers and capture those integers + (( =" mul(" \\( {++ \\d ++} \\) ` , \\( {++ \\d ++} \\) ` ) )) ; + + +: get-product ( addr u -- u2 ) + mul-instr \\ match the string from `addr u` against the above regex + if \\ if the regex matches, then: + \\1 s>number drop \\ convert the first capture from string to number, drop the status code (we already know it will succeed) + \\2 s>number drop \\ convert the second capture from string to number, drop the status code + * \\ multiply, and leave the answer on the stack + else + 0 \\ otherwise, leave 0 on the stack + then +; + +variable result \\ initialize `result` with 0 +0 result ! + +: sum-mul-instrs ( addr u -- u2 ) + begin \\ start looping + s" mul(" search \\ search for the string "mul(" + if \\ if successful, top 2 values on stack will be start address of "mul(" and remainder of original string + 2dup \\ duplicate address and remaining length of string + get-product \\ pass those to get-product above + result @ + \\ load `result` and add to product + result ! \\ store this new value back in `result` + 4 chop-prefix \\ bump the start of the string by 4 characters + else \\ if not successful, we have finished scanning through the string + 2drop \\ dump the string address and length + result @ exit \\ put the result on top of the stack and return to caller + then + again +; + +data-str sum-mul-instrs . +bye +``` + +Not exactly terse, but honestly it could have been a lot worse. And my extremely heavy use of comments makes it look bigger than it really is. + +The `( addr u -- u2 )` bits are comments, by the way. By convention when you define a word that either expects things on the stack or leaves things on the stack, you put in a comment with the stack of the stack in `( before -- after )` format. + +## Part 2 + +Onward and upward! In Part 2 we discover that yes, in fact not quite all of the "garbage" instructions were truly garbage. Specifically, there are two instructions, `do()` and `don't()` which enable and disable the `mul(x,y)` instruction. So once you hit a `don't()`, you ignore all `mul(x,y)` instructions, no matter how well-formed, until you hit a `do()` again. + +Easy enough, but I'm going to have to change some things. Right now I'm using the `search` word to find the start index of every possible `mul(` candidate, then using the regex library to both parse and validate at the same time. Obviously I can't do that any more, since now I have to search for any of three possible constructs rather than just one. + +I spent quite a while trying to figure out how to get the regex library to spit out the address of the match it finds, but to no avail. There are some interesting hints about a "loop through all matches and execute arbitrary code on each match" functionality that I could _probably_ have shoehorn into what I needed here, but in the end I decided to just scan through the string the old-fashioned way and test for plain string equality at each position. In the end it came out looking like this: + +```fs +... + +variable enabled +-1 enabled ! \\ idiomatically -1 is "true" (really anything other than 0 is true) + +: handle-mul ( addr u -- ) + get-product \\ pass those to get-product above + result @ + \\ load `result` and add to product + result ! \\ store this new value back in `result` +; + +: sum-mul-instrs ( addr u -- u2 ) + \\ we want to loop from addr to (addr + u - 8), because 8 is the min length of a valid mul(x,y) instruction + \\ we also want to have addr + u on the top of the stack when we enter the loop, + \\ so that we can use that to compute the remaining length of the string from our current address + + over + \\ copy addr to top of stack and add to length + dup 8 - \\ duplicate, then subtract 8 from the top value + rot \\ move original addr to top of stack + ( stack at this point: [ addr + u, addr + u - 8, addr ] ) + ( i.e. [ end-of-string, loop-limit, loop-start ] ) + + do \\ start looping + I 4 s" do()" str= \\ compare the length-4 substring starting at I to the string "do()" + if \\ if valid do() instruction, + -1 enabled ! \\ set enabled=true + then + + I 7 s" don't()" str= \\ compare length-7 substring to "don't()" + if \\ if valid don't() instruction, + 0 enabled ! \\ set enabled=false + then + + I 4 s" mul(" str= \\ compare length-4 substring to "mul(" + enabled @ and \\ combine with current value of `enabled` + if \\ if a candidate for `mul(x,y)` instruction, and enabled=true, then + dup I - \\ subtract current string pointer from end-of-string pointer to get length of remaining string + I swap handle-mul \\ put current pointer onto stack again, swap so stack is ( addr len), and handle + then + loop + + drop \\ get rid of end-of-string pointer + result @ \\ return value of result +; + +s" data/03.txt" load-data +sum-mul-instrs . +bye +``` + +Oh yeah, I also decided to extract `load-data` into a word and pass in the filename, to make it easier to switch between test and live data. The whole thing is [here](https://git.jfmonty2.com/jfmonty2/advent/src/branch/master/2024/03.fs) if you're interested. + +I'm actually surprised by how satisfied I am with this in the end. It's not exactly what I would call pretty, but it's reasonably comprehensible, and I feel like I'm starting to get the hang of this stack-manipulation business. It could definitely be more efficient - I'm still looping over every index in the string, even when I know I could skip some because, say, they already validated as being the start of a known instruction. Also, I should really avoid testing for subsequent instructions every time one of the prior ones validates. I just couldn't bring myself to do nested if statements because they [look like this](https://www.forth.com/starting-forth/4-conditional-if-then-statements/#h-nested-if-then-statements), which is horrible. + +## Nobody asked for my opinion, but here it is anyway + +So what do I think of Forth? It's certainly interesting! It's a very different way of approaching programming than I've encountered before. But I don't know that I'd want to use it for a serious project, because it's pretty lacking in the code-organization department. + +A lost of older languages have shortcomings with regard to things like namespaces, good facilities for spreading code across multiple files (and complex file hierarchies), and tools for building on _other_ peoples' code. Forth has those problems too, but they aren't really fundamental. It's easy to imagine a version of Forth with some namespacing, a package manager, etc.In fact, this is basically what [Factor](https://factorcode.org/) looks to be. No, what worries me about Forth from a code-organization standpoint is _the stack itself_. + +More specifically, it's the fact that there's only _one_ stack, and it's shared between every code-unit in the entire program. This might be ok if it were used exclusively for passing data around between different code units, but it isn't. From my limited experience, I get the impression that the stack is expected to be used for _most_ values. Sure, there are variables, but the amount of ceremony involved in using them makes it feel like Forth doesn't really want you to use them heavily. Plus, they're all global anyway, so they're hardly a help when it comes to code organization. + +The problem with the stack is, as I said, that it's shared, and that everybody has it. That means that if you're storing something on the stack, and you invoke a word, _it might just mess with that something on the stack_. Sure, maybe it isn't _supposed_ to, but you know, bugs happen. The history of computer science is arguably a long and mostly-fruitless quest in search of _programming paradigms that result in fewer bugs_. "Just trust me bro" is not a useful approach to code encapsulation in complex projects. + +Sure, you may say, but a function in any language can return incorrect values, so what's the big deal? Yes, that's true, but in most languages a function can't return _too many_ values, or too few. If it does, that's either a compile-time error or an _immediate_ runtime error, meaning the error occurs _at the point the misbehavior occurred._ This is critical, because errors get harder and harder to debug as their "skid distance" increases--i.e. the distance between the root cause of an error and the point at which that error actually manifests. + +Even more worrisome, as I alluded to previously, is the fact that the stack makes it possible for words to mess with things that are (from the programmer's point of view) _totally unrelated to them_. You could end up with a situation like this: + +```fs +put-some-memory-address-on-the-stack +17 do-something-unrelated +\\ an error in do-something-unrelated causes it to delete the memory address from the stck +do-more-things +... +attempt-to-access-missing-address \\ ERROR! What happened? Why isn't that address where I expected it to be? +``` + +This is the sort of thing that causes people to get woken up at 3AM to deal with production outages. The entire _class_ of purely-functional languages exists on the thesis data being twiddled with by bits of code that aren't supposed to is _so bad_ that it's better to disallow _all_ data-twiddling _forever_, full stop. I shudder to think what would happen if you dropped a Haskellite into a Forth world. He'd probably just keel over and die on the spot. + +## So what is it actually good for? + +The short answer is, embedded systems. That has traditionally been the wheelhouse of Forth, and as far as I can tell it continues to be so insofar as Forth is still used at all for real live projects. + +This makes a lot of sense, when you think about it. Embedded code is often: + +* Very tightly focused, concerned with just doing a few specific things under all circumstances (as contrasted with, say, your typical webapp which might have to add file uploads or start interfacing with some remote API any day of the week) +* Resource-constrained, particularly where memory is concerned (CPU time is usually less of a concern) +* Developed by a small team, often just a single person +* Fire-and-forget, there's no "ongoing maintenance" when you would have to perform a surgery or recall a satellite from orbit to do a firmware upgrade + +All of this works to minimize the downsides of Forth's lack of organizational capabilities. Organizing is easier the less there is to organize, of course, and when there's only one or a few people doing the organizing. And when it's all done in one shot--I can't count the number of times I've come back to code _I wrote_ after a year or two or three and spent the day going "What was I _thinking_?" while trying to unravel the tangled web I wove. That sort of thing is much less likely when your code gets deployed as part of Industrial Sewage Pump Controller #BS94A, because who wants to go swimming in sewage if they don't have to? + +The other reason I suspect that Forth has been so successful in embedded contexts is that it's _a lot_ easier to implement than most languages. This is true of stack-oriented languages in general, I think--there's a reason that a lot of VMs are stack-based as well--and in an embedded context, if you want a runtime of any kind you often have to build it yourself. I don't know this for sure, but I wouldn't be surprised if Forth got used for a lot of embedded stuff back in the day because it was the most feasible language to implement in raw assembly for the target architecture.Of course, these days C is the lingua franca of everything, so I suspect this effect is weaker than it once was. Maybe there are some wild esoteric architectures out there that don't even have a C compiler, but I highly doubt there are many. + +Of course, I've constructed this tottering tower of linguistic philosophy on the basis of a few days' playing with Forth when I've had time, so take what I say with a few spoonfuls of salt. There are people out there who will [claim](http://collapseos.org/forth.html) that they used to prefer C to Forth, but after enough time with Forth their eyes were opened and their mind ascended to the heights, and they spoke with the tongues of the cherubim and painted with the colors of the wind-- + +Sorry, that got a little out of hand. My point is, some people like Forth even for relatively complex tasks like an OS kernel.Although that particular guy is also prophesying the [end of civilization as we know it](http://www.collapseos.org/), so, I dunno. Maybe he just sees things in a different light. Ultimately, though, I think the proof is in the pudding. Forth and C hit the scene at around the same time, but one of them went on to underpin the critical infrastructure of basically all computing, while the other continues to languish in relative obscurity, and we all know which is which.