The biggest mistake I made was high uptime. arjie.com was up for 10 years plus o...

gerdesj · 2026-05-21T23:31:04 1779406264

"The biggest mistake I made was high uptime"

Quite. I'm old enough to remember machine uptime being a badge of honour.

However, being older and not really wiser, I look for service uptime these days. Yes we did have similar back in the day, that's why MX and the like DNS records exist.

Old school clusters were pretty esoteric but the lessons were learned (split brain n that) and that's why we still argue the toss with kiddies about why a Proxmox cluster with two nodes is fucked and why we recommend an additional "witness".

I don't care that VMware glossed over the whole two node HA cluster thing years ago with a massive bodge. They were wrong then and they are probably still wrong because that nonsense is probably still baked in.

Sorry, slight digression.

High uptime implies no patching. We all love patching.

andai · 2026-05-21T23:47:03 1779407223

https://en.wikipedia.org/wiki/Split-brain_(computing)

The more you know!

>a Proxmox cluster with two nodes is fucked and why we recommend an additional "witness".

Reminds me of the three Magi from Evangelion: https://magi.kinta.ma/

red-iron-pine · 2026-05-22T13:00:48 1779454848

"a man with two watches can never be sure as to the time"

need a third one to confirm which of the 2 is accurate

pjmlp · 2026-05-22T05:47:30 1779428850

There is something like live patching.

One reason mainframes and micros are still around us, is that you can change almost everything between hardware and software without downtime.

It is also available in commercial surviving UNIXes, and as paid for feature in some Linux distros, although not to the extent that those grandparent systems are capable of.

da_chicken · 2026-05-22T08:10:01 1779437401

The problem with live patching is twofold.

First, you might not reload everything in memory, so it will be patched on disk but not in process.

Second, you have not tested that the system can boot to a functional system. Say you have done live patching for 5 years and never rebooted, and then you have a power loss or hardware failure/upgrade that takes the system down. When you try to bring it back up, it doesn't work. Which configuration change in the past 5 years caused that? Which backup do you use?

And, yeah, everything is hot swappable on VAX. Those machines also cost 6+ figures, and often require a service contract that includes a permanent on site tech.

kjs3 · 2026-05-22T15:00:57 1779462057

And, yeah, everything is hot swappable on VAX.

Only the last generation or 2 of the highest end VAXen had any significant hot swap (VAX 9000/400 and later, which sold very poorly). The vast majority of VAX machines didn't. Even hot-swapping DSSI disks was at best iffy.

When someone whose been there talks about VAX 'high availability', they're usually talking about VAX/VMS clustering. Very cool and generally effective approach to the problem. That was one big issue with the end-game VAXen: clustering a couple of 6-figure mid-range machine was often considered a better solution than all-in on one 7- to 8-figure VAX 'mainframe'.

often require a service contract that includes a permanent on site tech.

I don't recall that being common with DEC service contracts. Most of the sites I know of that had dedicated DEC techs were either very large installs or had...other...drivers (e.g. tech had to have a TS clearance to work on the machines).

Squeeeez · 2026-05-22T19:36:15 1779478575

How would you implement no-downtime hot swap with only one item?

kjs3 · 2026-05-22T21:51:57 1779486717

By implementing hot-swap into the one item? Am I missing something in this question?

da_chicken · 2026-05-23T09:15:27 1779527727

Executing hardware hot-swap typically means telling the system that a component is going down. Then the system moves those resources to the other component to gracefully allow you to remove it without a restart.

Like it's not a case where you just yank out a CPU as you like as though it were a spindle in a RAID-6 array. Especially if there's only one CPU. The state machine can't maintain state if the only component that tracks and maintains state goes missing.

coldtea · 2026-05-22T09:28:43 1779442123

>First, you might not reload everything in memory, so it will be patched on disk but not in process.

You design for this with generational tagged objects or something similar.

mx7zysuj4xew · 2026-05-22T12:54:25 1779454465

Which is moot, because of the system is important enough you'll have an automatic failover to another system running on standby

All this "we must reboot to test" is bullshit excuses by unqualified workers

z3t4 · 2026-05-22T13:50:50 1779457850

Had an accidental reboot, and it could not boot. Had redundancy, but the other server had failed silently days prior. Solved it with three way redundancy and extra monitoring. Systems fail in many ways at the same time. If you do not test it, there is a chance it wont work. Controlled failure is preferred over unknowns, like rebooting once in a while just to make sure it works.

close04 · 2026-05-22T17:23:16 1779470596

Ah, spoken with the confidence of a freshly minted qualified worker :). Anything you don’t test is a wish, not a production system. You either know that your systems work end to end because you tested periodically, or you pray they will.

How do you know the automatic failover works? How do you know the standby system works?

I’ve seen many a “qualified workers” getting sent packing because they never fully tested the prod system because they just knew everything will work, and never tested the backup systems because qualified workers do the job right the first time, no need for backup.

X0Refraction · 2026-05-22T14:43:27 1779461007

Not sure I'm following honestly. Your primary goes down and it fails over to the secondary (which becomes the primary), but if you can't boot how do you then get another secondary ready to fail over to again when the new primary inevitably fails?

fragmede · 2026-05-22T21:11:23 1779484283

You patch it in memory and on disk. What you put on disk is the patch though, so when you restart, the original unpatched version is booted, and then the same live patch is applied. This is how Ksplice worked. It has the advantage that there isn't a config file in /etc to get changed out from under it, so the second problem did not apply.

da_chicken · 2026-05-23T09:01:01 1779526861

Ksplice can do that because the kernel is only in memory in one place an it never sleeps. It has to orchestrate a process that's always running, which is complex, but it's never more than one.

Now try patching glibc like that. Not only does almost every thread have it in memory, several of them will have it in process, and some of them will have it swapped to disk while the thread sleeps. You're going to quickly decide that you actually just want a little bit of downtime or else you want to stand up a redundant system. There's a reason that some live patching systems explicitly exclude glibc and similar libraries.

fragmede · 2026-05-23T19:43:19 1779565399

The corporate named "Ksplice for userspace" did exactly that, patch glibc.

https://blogs.oracle.com/linux/new-userspace-patching-with-o...

pjmlp · 2026-05-22T10:25:29 1779445529

Yes, some things actually cost money, especially if they aren't easy to implement.

silvestrov · 2026-05-22T07:44:45 1779435885

A Danish bank found out that this can bite you in the ass.

When you hotpatch the system for years then you have no idea if the system can boot up or it will fail somewhere in the booting process.

i.e. you can only trust what you regularly test.

Suzuran · 2026-05-22T12:19:25 1779452365

Mainframes can LPAR dynamically. When you want to test if your production system will IPL cleanly, you clone your production environment to an isolated LPAR and IPL it. No impact to production and you get your test.

dredmorbius · 2026-05-22T18:53:58 1779476038

US telcos as well.

There were several switch failures in the 1980s / 1990s in which systems which had been upgraded in place without a full restart failed. (IIRC, one burnt down, literally.)

Engineers were uncertain as to whether or not a cold-boot restart was even possible.

Account concerning an AT&T system upgrade sourcing Risks Digest (Vol 9, Issue 62, February 26, 1990) by the recently deceased Peter G. Neumann: <https://telephoneworld.org/landline-telephone-history/the-cr...>.

pjmlp · 2026-05-22T08:04:46 1779437086

Interesting, it there any public info on the case?

Not doubting it, only curious about some kind of postmorten.

silvestrov · 2026-05-22T10:39:36 1779446376

In Danish: https://danskebank.com/da/news-og-insights/nyhedsarkiv/press...

or translated: https://danskebank-com.translate.goog/da/news-og-insights/ny...

TLDR: power supply failed completely and DB2 failed running recovery operations due to multiple old/existing software bugs.

pjmlp · 2026-05-22T11:18:59 1779448739

Thanks for hunting it down.

ErroneousBosh · 2026-05-22T08:04:53 1779437093

> One reason mainframes and micros are still around us, is that you can change almost everything between hardware and software without downtime.

We have some Sun V880s at work and I'm fairly sure the only part you cannot change with the power on and system running is the motherboard itself.

And I would not be surprised if some ex-Sun Gandalf Beard "well akshully"s this comment.

linksnapzz · 2026-05-22T12:35:55 1779453355

Hot swapping the failed half of a bonded NIC pair on a v880 was a treat…

ErroneousBosh · 2026-05-26T07:24:03 1779780243

I come from an era when unplugging the RAM pack could blow every chip on the ZX80's board, so hot-swappable PCIe cards are just absolute fucking black magic to me.

linksnapzz · 2026-05-26T21:13:56 1779830036

Yeah, I almost had a heart attack the first time I saw someone do a 'cfgadm unconfigure' && 'cfgadm disconnect'; then pop open the side of a prod box, press a button and pull a card out.

"See, oracle's still running!"

Things like that used to be how one distinguished enterprise hw & sw vs. PCs w/ delusions of grandeur.

mesrik · 2026-05-22T13:04:59 1779455099

You should't need mainframe for 100% (or five nines if that's fine) service uptime.

You can build that way cheaper with 2-3 proper clustered load balancer units, 2-3 application servers behind those and those using persistent storage (databases,ldap, files) which allow writing multiple nodes simultaneously.

I used to work uni that we had few services from 2012 to 2025 my retirement with zero downtime. One time my manager with tech background tried to add PBR in hurry using WebUI and did not understand cli syntax and caused close to require reboot, but I was able to fix it from cli rolling back previous config and rebooting one unit at time. Upgrading software major version up to each unit supported level wasn't hard, upgrade node it joins back cluster, upgrade another node and it joins cluster, all done. Few times I had to fix manually config for some less important test backend servers that I had forgotten to change before upgrade. No big deal. No major outages during all that 13 years time happened. Some redirecting policy and action syntax was first hard to understand and learn like GeoIP, but I was very surprised how darn reliable and nice they to use and maintain.

The LB's were (Citrix) Netscalers in clustering mode (all nodes process traffic concurrently), which allowed live update one node at time without losing any connectivity through them. That wouldn't have been possible devices in just HA mode.

We had just 2 beefy units which worked very well for us, but you can have 2-32 of them in cluster and managing thousands of servers behind them if you need that. Netscalers are FreeBSD derived where quite a bit of the TCP/IP stack was rewritten adding support many some quite odd features std FreeBSD doesn't have. Much of that is IP/ethernet multicast features, PBR's, Traffic Domains (VRF's) and of many service and monitoring processes which sync cluster (or HA) and if node fails another can continue straight from there without any loss of traffic to clients being proxied.

Though I think most people in this forum are familiar with with haproxy, pound and web-server software provided reverse proxying.

A car analogy if previous were your fancy sport sedan Netscaler and F5 BigIP are formula F1 class cars ie. quite different beasts altogether.

e: And proper LB's are not just for HTTPS etc. but very nice proxying many other protocols were they TCP, UDP or something else. We did done VPN's and something like Cisco AP'S CAPWAP (DTLS ie SSL over UDP). e: typo.

pjmlp · 2026-05-22T13:10:03 1779455403

> You should't need mainframe for 100% (or five nines if that's fine) service uptime.

Hence my second paragraph.

Thanks for sharing the story.

Scramblejams · 2026-05-22T06:00:55 1779429655

I’ve long wanted that amazing uptime and virtualization and huge I/O and all that cool stuff mainframes offered, but on the desktop or in the closet, with modern CPUs.

I think I’m gonna hafta keep waiting...

AdamN · 2026-05-22T08:01:35 1779436895

two is the right minimum number for a high availability dataplane but three is the right minimum number for a HA control plane.

With that said, if high availability is not a concern then 1 can be just fine.

brightball · 2026-05-22T14:27:17 1779460037

In 2012 I took over a Perl project that was running on 25 BSD servers (OpenBSD I think?) that had not been updated / patched since 2000. It was an interesting time.

vablings · 2026-05-22T18:12:34 1779473554

My raspberry pi serves only to be the tiebreaker my possible split brain 2 node cluster lol. It is literally called tiebreaker

j45 · 2026-05-22T10:55:48 1779447348

It's pretty easy to abstract away a proxmox node into a terraform or other type of code based recipe for easy backup / reconstruction / upgrading.

jeanlucas · 2026-05-22T14:05:00 1779458700

> Yes we did have similar back in the day, that's why MX and the like DNS records exist.

Care to elaborate? I wanna know more.

kjs3 · 2026-05-22T15:08:18 1779462498

MX records publish an SMTP server for a domain and a 'priority'. You can have multiple MX records and (theoretically[1]) you try the one with the lowest priority, and if it doesn't respond, try the next lowest, etc. Or (theoretically[1]) if you have 2 MX records with the same priority, you can load balance between them.

https://www.cloudflare.com/learning/dns/dns-records/dns-mx-r...

[1] yes...I know there's a ton of caveats here...

niel · 2026-05-22T04:39:39 1779424779

This reminds me of Ise Shrine in Japan, which is completely dismantled then rebuilt every 20 years.

This is top of mind because I recently read Breakneck by Dan Wang. He makes the case that this practice of rebuilding the shrine preserves knowledge that would otherwise have been lost to time. Wang contrasts Ise Shrine with Notre Dame, where rebuilding the roof is apparently quite difficult, perhaps in part due to the loss of knowledge. I'm not familiar enough with either structure to judge whether this is a fair comparison, but I like the principle.

(Edit to add: This is only a minor analogy from the book, which I highly recommend overall.)

arjie · 2026-05-22T05:03:20 1779426200

Thank you for the recommendation! I love that reference, and particularly because I am fond of the story of the shrine for a different reason https://wiki.roshangeorge.dev/w/Constancy_Preference#Concept...

nine_k · 2026-05-21T22:57:15 1779404235

Indeed, for a VM, high uptime makes little sense, because a reboot takes a few seconds, and an upgrade requires no downtime, just switching the DNS to a new instance.

For a physical machine which you can't easily copy, it's a different story.

mx7zysuj4xew · 2026-05-22T12:56:05 1779454565

We're talking what, 15 minutes to reach post?

bfivyvysj · 2026-05-21T23:10:30 1779405030

I started putting things in a big ansible playbook repo. Don't need to have it fully managed by ansible either I mostly just have setup configured there I still do lots of by hand management.

arjie · 2026-05-21T23:37:20 1779406640

I have the same. The infra management is in one place, the apps hold their own, and there’s a docs folder on the server where each guy puts his stuff. The install is idempotent deploy scripts. But back then my stuff was more ramshackle.

culi · 2026-05-22T04:56:08 1779425768

Sometimes I leave Architectural Decision Records for personal projects. It feels silly but it honestly comes in handy more times than expected

gofreddygo · 2026-05-22T05:22:45 1779427365

I keep them embedded in the codebase or an artifact right next to the source.

And the key thing is that i dont need too many details at all. A few cues and its all back in my head.

walletdrainer · 2026-05-22T05:59:06 1779429546

> The biggest mistake I made was high uptime. arjie.com was up for 10 years plus on a Hetzner VPS so that by the time they wanted to sunset the machine underlying I had no idea what my teenage self had set up. I have the backups but the site hasn’t been up in a decade

LLMs have solved this problem, they’ll happily deal with the software archaeology on your behalf. This is the kind of task they really excel at.

arjie · 2026-05-22T06:13:02 1779430382

You're right, of course. At this point it's inertia. It's been dead a decade.

bradley13 · 2026-05-22T12:05:00 1779451500

I hear you. On the other hand, not having to mess with something is good. I just make extensive notes in a README somewhere - usually in KeePass right next to the system info.

unethical_ban · 2026-05-22T16:07:45 1779466065

I stood up a dokuwiki instance recently and then documented how to stand up dokuwiki, haha.

I disabled revision history viewing and have a public portion and a private portion. I use it to track things I'm learning and document rollout procedures and commands I need for things. So far I have rclone backups into S3 Glacier, Tuwunel(Matrix) server deployment with voice/video support, and various little tutorials on server stuff I'm learning.

TLDR use a wiki!