More

data-ottawa · 2026-04-15T01:18:44 1776215924

Yesterday I had my biggest Opus WTF.

I asked Opus 4.6 to help me get GPU stats in btop on nixos. Opus's first approach was to use patchelf to monkey patch the btop binary. I had to redirect it to just look the nix wiki and add `nixpkgs.config.rocmSupport = true;`.

But the approach of modifying a compiled binary for a configuration issue is bizarre.

pxc · 2026-04-15T03:06:09 1776222369

It does stuff like this all the time. It loves doing this with scripts with sed, so I'm not surprised to hear about it trying to do this with binaries. It's definitely wilder, though

spoiler · 2026-04-15T06:32:22 1776234742

It frequently gets indentation wrong on projects, then tries to write sed/awk scripts. Can't get it right, then write a python script that reformats the whole file on stdout, makes sure the indentation is correct, then writes requests an edit snippet.

And you might be thinking. Well, you should use a code formatter! But I do!

And then you might say, well surely you forgot to mention it in you AGENTS/CLAUDE file. Nope, it's there, multiple times even in different sections because once was apparently not enough.

And lastly, surely if I'm watching this cursed loop unfold and am approving edits manually, like some bogan pleb, I can steer it easily... Well, let me tell ya... I tried stopping it and injecting hints about the formatter, and it stick for a minute before it goes crazy again. Or sometimes it rereads the file and just immediately fucks up the formatting.

I think when this shit happens, it probably uses like 3x more tokens.

For a Rust project, it recently stated analysing binaries in the target as directory a first instinct, instead of looking at the code...

Good grief.

data-ottawa · 2026-04-13T21:41:47 1776116507

I very recently learned that with most readline apps and terminal password inputs, ctrl+u clears the input.

Very handy when you make a typo far into a long password and can't keep track of whether you've mashed backspace enough.

data-ottawa · 2026-04-13T12:27:30 1776083250

I had the same issues. I had to tell it to use sub agents explicitly, and instead of saying set a cron say set an openclaw cron.

I generally do like the model, it’s not a great agent though.

It’s good for summarization tasks, small tool use, and has pretty good world knowledge, though it does hallucinate.

data-ottawa · 2026-04-13T01:26:46 1776043606

I was going to do a deep analysis on this, and then I noticed that Claude Code deleted all of my sessions before March 6.

So yeah... I'm not thrilled with that, because I had done a similar analysis in December and had plenty of logs to review.

The results I do have for the last month aren't great. If you're curious I did post the results on HN:

https://news.ycombinator.com/item?id=47679661

data-ottawa · 2026-04-10T16:44:14 1775839454

I started using marimo for the reactive execution, after being spoiled by Observable and Pluto.jl Being able to plug directly into Altair charts and tables was a huge boon. Then I discovered anywidget, which has been a game changer.

Now I use Claude to generate anywidgets for controls I need, and just focus on the heavy lifting with python, it's great. Being able to just have this all run in one flow with pair should make this 10x smoother.

As an example I get spreadsheets sent by clients that all have different file types, formatting, names, and business rules. I had Claude build me a widget to define a set of data-cleaning steps (merge x+y fields, split with regex, etc.). Now this task that used to take a lot of manual work and iteration is just upload a spreadsheet, preview and select my cleaning steps, run my algorithm and wait for it to come out the other side (with labelled progress bars). When it's done I get a table element and some interactive Altair charts to click on to filter and fine-tune, then I can just export the table and send it.

This task used to be done manually by a team, then I turned it into 1-2 hours with Jupyter. Marimo let me turn it into 5-15 minutes. Visually inspecting the results by a human is a requirement, so it's not completely automatable, but 15 mins turnaround every few weeks feels good enough.

Anyways, marimo rocks. The _only_ thing missing is the easy deploy for internal-users story as I cannot use molab (yet?).

manzt · 2026-04-10T17:22:47 1775841767

Hey, thanks and glad to hear the marimo + anywidget combo has been an unlock (I'm also the creator of anywidget). Clearly I'm biased, but custom widgets are a powerful primitive (marrying web & data ecosystems), and it's exciting to see coding tools making it even more accessible to build them out for specific or one-off tasks.

Re: deployment, we hear you & stay tuned. You can provide input here [1].

Side note: if you're curious, I have an RFC out for widget composition (widgets within widgets) [2]. Should be shipping soon.

[1] https://github.com/marimo-team/marimo/issues/5963

[2] https://github.com/manzt/anywidget/pull/942

crashabr · 2026-04-10T17:15:04 1775841304

The visual cleaning idea is really interesting. Would you mind sharing more details?

data-ottawa · 2026-04-10T23:55:37 1775865337

It's nothing revolutionary.

It's essentially a table layout with a plus button at the bottom. When you click it adds a new step as a row, then you pick the operation, the input columns and output column name.

If you want to add another step you click the plus again and add another row the same way. Each row can access any table field or output field defined above it in the DAG.

Then in Python a for loop runs over the steps in order and updates the data frame in place (well, in function, returning the new one). It uses a dictionary of function mappings and resolves input fields with kwargs.

data-ottawa · 2026-04-10T13:04:19 1775826259

This and subheading like “the problem” “The feature space” bother me for reasons I can’t fully explain.

It feels like the laziest possible section separator and generally would be better with an extra space divider or something.

It’s so prevalent in AI writing.

nz · 2026-04-10T15:44:48 1775835888

The mental-model that I am using for online writing, is that it is analogous to the spectrum of `pretending <-> acting`. The worst writing (AI or otherwise), looks, sounds, feels like pretense, like a kid that tucks a towel into his shirt, and runs around, pretending to be a super-hero. Meanwhile, acting, true acting, is invisible, it is a synonym for _being_[1].

That said, a lot of the AI writing feels "procedural", in the sense that most corporate writing (whitepapers, press releases, etc) feel procedural (i.e. the result of a constructed procedure). Before AI, the constructed procedure was basically that a piece of writing passes through a bunch of people (e.g. engineering -> management -> marketing -> website/email), and the output is a bland, forgettable pablum designed to (1) be SEO-friendly, (2) be spam-filter friendly, (3) be easy to ingest, (4) look superficially trustworthy and authoritative (e.g. inflated page count, extra jargon, numbers, plots), (5) look like it belongs to the "scene" or "industry" by imitating all the other corporate writings out there[2].

AI is interesting, in the same way that computers or the internet or an encyclopedia are interesting: how people choose to use it tells you a lot about them. All of those technologies can be used to compensate for a lack of skill (it helps one pretend), or they can be used to forge a skill (it helps one become).

One has to pretend, before they can act (I guess? Feels intuitively correct to me). So perhaps, AI (and web, and computer, and encyclopedia) is only harmful to the extend that it does not nudge a person towards becoming[3]? And if so, that's a _cultural_ limitation, not a technological one.

[1]: I am not an actor, and so I might be wrong, but that is the impression I get from just watching and analyzing the acting in various films.

[2]: this becomes frustrating when you get criticized for producing something that "reads like $famousSomething", and then you get criticized again for producing something that "does not read like $typeOfFamousSomething".

[3]: No clue how you (plural -- let's bring back "yous") will convince your boss that you did not take the shortcut, because you were trying to "become more".

ghurtado · 2026-04-10T14:11:52 1775830312

I'm worse than you: the quotes are what drive me insane:

> . “HP never exceeds max”

I think it's because its such a braindead thing to fix that when I see them, it's clear the "author" hasn't even read their own "work".

Like, you're not even trying to hide it at the laziest level possible. Blegh.

(See how you can tell a human wrote that?)

data-ottawa · 2026-04-09T21:01:01 1775768461

You’re either an open platform or you’re not.

Why can Meta run fake ads of my prime minister or the CBC to front scams with no due process, but for this they can use their judgement to block?

I know they’re an American company and my complaints are Canadian, but the double standard stinks.

bilekas · 2026-04-09T22:36:47 1775774207

I don't know Meta ever claimed to be an open platform, Twitter did though. It used to be an almost 'human protocol'. But as humans, it was flawed.

garfieldnate · 2026-04-10T17:20:42 1775841642

You can't expect a company to support you in directly hurting them. This reminds me of a guy I know that sued his bank regarding a loan, but he used a terrible lawyer. The bank pretty quickly took their own action, including stopping all payments from his account (as they claimed that the money was theirs).

flawn · 2026-04-10T22:53:41 1775861621

It's not any company, its Meta and the channels they administrate come with a set of responsibilities and principles, and one such is not to break these by arbitrary, willful removal of totally legal ads.

qeternity · 2026-04-12T17:10:30 1776013830

> It's not any company, its Meta and the channels they administrate come with a set of responsibilities and principles

Sorry, which laws stipulate these special responsibilities and principles?

gorgonian · 2026-04-11T17:45:23 1775929523

The bank stole his money, claiming it was theirs? That’s a surprisingly strong action considering there’s no way that could hold up in court…right?

abustamam · 2026-04-09T22:30:39 1775773839

Did anyone ever accuse Meta of being an open platform though?

acdha · 2026-04-10T03:22:12 1775791332

Isn’t that their defense against responsibility for their customers’ content? Having some broad filtering for legal requirements or scams is one thing but if they’re doing this it seems like support for cases alleging that they have editorial control and therefore responsibility.

toast0 · 2026-04-10T14:04:10 1775829850

Legally, they don't need to choose. Section 230 limits provider liability for moderating user content and also limits provider liability for not moderating user content. I think the intent of Section 230 was to apply liability to the users making the content, not the service provider transmitting it; however, IIRC, cutrent jurisprudence makes it very hard to compell service providers to identify users in civil cases, so civil liability is hard to pursue, unless the user identifies themself in their content.

It's not a question of if they're a common carrier or nor; they don't need to be, and typically, they don't try to be.

abustamam · 2026-04-10T10:59:22 1775818762

That's true. I haven't been keeping up with FB lawsuits but from what I gather of HN sentiment, FB is not open and never has been. Any FB exec claiming to be open is probably just doing exactly what you said, and they'll probably find a way to spin it to include this exclusion as part of their "openness."

worik · 2026-04-10T00:29:39 1775780979

> Did anyone ever accuse Meta of being an open platform though?

My memory says it was Meta who said that. Zuck himself....

abustamam · 2026-04-10T01:31:58 1775784718

I remember that but I was talking about anyone outside the company. I think many platforms call themselves open but most aren't really.

worik · 2026-04-10T20:11:35 1775851895

Yes, you are correct!

Mark Zuckerberg is a liar

nubinetwork · 2026-04-10T11:40:47 1775821247

> fake cbc ads

I see those through google more than I do through FB ads. /shrug

data-ottawa · 2026-04-07T18:48:08 1775587688

I reviewed 118 conversations with Claude since March 6, all on real work projects.

Each conversation was processed to assess level of frustration, source of frustration, and evaluated with Gemma 4 and Claude Opus for spot checking. I have a tool I use to manage my work trees, so most work has is done on branches prefixed with ad-hoc/feature/explore or similar, and data was tagged with branch names.

43% of my Claude Code sessions (Opus 4.6, high reasoning) ended with signals of frustration. 73% of total chat time (by total messages) was spent in conversations which were eventually ranked as frustrating.

Median time to frustration was 25 messages, and on average, each message from Claude has about a baseline 5% chance of being frustrating. Frustration by chat length actually matches this 5% baseline of IID Bernoullis -- which is surprising and interesting, as this should not be IID at all.

Frustration types:

- Wrong answers – 14% of sessions, 31% of frustration

- Instruction Following – 11% of sessions, 25% of frustration

- Overcomplication – 8% of sessions, 18% of frustration

- Destructive Actions (e.g. requesting to delete something or commit a change to prod) – 3% of sessions, 8% of frustration

- Non-responsive (service outages leading to non-response) 2% of sessions

- Miscommunication 2% of sessions

- Failed execution 2% of sessions

Half of frustrations happened in the first or last 20% of a chat by length. I interpret early frustrations to be recoverable, late frustrations to be terminal.

Early frustrations (sessions averaged 45 turns):

- 30% overcomplicating the problem

- 30% instruction following issues

- 30% wrong answers

- 10% destructive actions

Late frustrations (sessions averaged 12 turns -- i.e. terminal context early)

- 36% Wrong answers, with repetition

- 21% instruction following, with repeated correction from user (me)

- 14% Service interruptions/outages

- 7% failed execution

- 7% communication - Claude is unable to articulate some result, or understand the problem correctly.

Late frustrations led to the highest levels of frustration, 29% of the time.

I'm a data scientist — my most frustrating work with Claude was data cleaning/repair (a complex backfill) issues -- with 75% of sessions marked frustrating due to overcomplicating, instruction following, or destructive actions).

The best (least frustrating) workflows for DS were code-review, scoped feature work (with tickets), data validation, and config/setup tasks and automation.

Ad-hoc query work ended up in between -- ad-hoc requests were generally bootstrapping queries or doing rough analysis on good data.

Side note: all of my interactions with the /buddy feature were flagged as high frustration ("furious"). That was a false positive over mock arguing with it, but did provide a neat calibration signal. Those sessions were removed entirely from the analysis after classification.

data-ottawa · 2026-04-07T12:34:59 1775565299

It’s a reference to Iran attacking AWS data centres in those countries.

data-ottawa · 2026-04-02T19:45:58 1775159158

If you opened an app like Xcode with a lot of menus options, it would extend beyond across the screen and cover up your menu bar icons.

If I open Xcode today on a 14" MacBook, two menu items extend past the notch, and they still hide your menu bar icons.

This has been the case for a long long time, and it's always been an obvious failure case.