A story of a large loop with a long instruction dependency chain

robinhouston · on March 1, 2024

Since the site seems to have been hugged to death, here's an archive link: https://web.archive.org/web/20240229063944/https://johnnyssw...

temende · on March 1, 2024

At least now they're not exceeding their bandwidth cap ;)

jart · on March 1, 2024

I know when I was coding an avx2 matmul last month, having multiple dot product dependency chains operating in parallel was the single biggest thing that brought it in the same league of performance as MKL. It was like a night and day difference the first time I ran the program after doing that. Using lookaside L1 cache didn't help me very much, since it worked better to share register loads across operations.

politician · on March 1, 2024

[flagged]

markburns · on March 1, 2024

The punchline is inside the article:

> We at Johnny’s Software Lab LLC are experts in performance.

Comment not intended to come across as snidey, just tickled me a bit after reading yours and clicking the archive link.