Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS

arkensaw · 2026-04-06T23:22:50 1775517770

This is great, and I'm not knocking it, but every time I see these apps it reminds me of my phone.

My 2021 Google Pixel 6, when offline, can transcribe speech to text, and also corrects things contextually. it can make a mistake, and as I continue to speak, it will go back and correct something earlier in the sentence. What tech does Google have shoved in there that predates Whisper and Qwen by five years? And why do we now need a 1Gb of transformers to do it on a more powerful platform?

pushedx · 2026-04-07T07:24:52 1775546692

It's the same model used for the WebSpeech API, which can operate entirely offline.

Google mostly funded the training of this model around 10 years ago, and it's quite good.

There are many websites that are simple frontends for this model which is built into Webkit and Blink based browsers. However to my knowledge the model is a blob packed into the apps which is not open source, hence the no Firefox support.

https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...

https://www.google.com/intl/en/chrome/demos/speech.html

com2kid · 2026-04-06T23:38:54 1775518734

Microsoft OneNote had this back in 2007 or so, granted the speech to text model wasn't nearly as advanced as they are now.

I was actually on the OneNote team when they were transitioning to an online only transcription model because there was no one left to maintain the on device legacy system.

It wasn't any sort of planned technical direction, just a lack of anyone wanting to maintain the old system.

rudhdb773b · 2026-04-07T05:37:32 1775540252

I remember trying out some voice-to-text around 2002 that I believe was included with Windows XP.. or maybe Office?

You had to go through some training exercises to tune it to your voice, but then it worked fairly well for transcription or even interacting with applications.

silon42 · 2026-04-07T06:18:57 1775542737

OS/2 had it built in in 1996.

adamsmark · 2026-04-06T23:43:59 1775519039

The accuracy is much lower though.

I've switched away from Gboard to Futo on Android and exclusively use MacWhisper on MacOS instead of the default Apple transcription model.

dotancohen · 2026-04-07T03:23:50 1775532230

Any particular reason why you switched? I've been using Gboard for years, especially the text to speech in four languages. In the past few weeks, there was an update where the TTS feature is now in a separate "panel" of the keyboard, and it hardly works at all.

In English and Hebrew it stops after half a dozen words, and those words must be spoken slowly and mechanically for it to work at all. Russian and Arabic are right out - I can't coax any coherent sentence out of it.

I've gone through all permutations of relevant settings, such as "Faster Voice Dictation" (translated from Hebrew,I don't know what the original English option is called). I think there used to be an option for Online or Offline transcription, but that option is gone now.

This is ridiculous - I tried to copy the version information and there is no way to copy it in-app. Let's try the S24 OCR feature...

17.0.10.880768217 release-arm64-v8a 175712590 ראשית (en_GB) 2025090100 = גרסה עדכני Primary on-device: No packs Fallback on-device: Packs: ru-RU: 200

I'll try to install the English, Hebrew, and Arabic packs, though I'm certain that I've installed them already.

cootsnuck · 2026-04-06T23:45:37 1775519137

Interesting. My Pixel 7 transcription is barely usable for me. Makes way too many mistakes and defeats the purpose of me not having to type, but maybe that's just my experience.

The latest open source local STT models people are running on devices are significantly more robust (e.g. whisper models, parakeet models, etc.). So background noise, mumbling, and/or just not having a perfect audio environment doesn't trip up the SoTA models as much (all of them still do get tripped up).

I work in voice AI and am using these models (both proprietary and local open source) every day. Night and day different for me.

taffydavid · 2026-04-07T08:04:28 1775549068

I've built my own tts apps testing whisper and while it's good it does hallucinate quite a bit if there's noise, or just sometimes when the audio is perfectly clear.

It often gives the illusion of being very good but I could record a half hour of me speaking and discover some very random stuff in the middle that I did not say

cootsnuck · 2026-04-07T16:29:41 1775579381

Yup, you're absolutely right. The open source models do have their rough edges. I use NVIDIA's Parakeet v3 model a lot locally, and it will occasionally do this thing where it just repeats a word like a dozen times.

artdigital · 2026-04-07T06:20:57 1775542857

macOS and iOS can do that to with the baked in dictation. Globe key + D on Mac

dust42 · 2026-04-07T07:13:06 1775545986

When you activate it you agree that your voice input is sent to Apple. As far as I understand this project runs fully locally. Up to you to decide for whatever suits your needs best.

stingraycharles · 2026-04-07T09:27:47 1775554067

Where did you get from that the voice input is sent to Apple / the cloud?

As far as I understand Apple’s voice model runs locally for most languages.

Siri commands can be used for training, but is also executed locally and sent to Apple separately (and this can be disabled).

angristan · 2026-04-07T13:55:21 1775570121

I couldn't believe it either but when you enable it the settings of macOS you get this popup:

> When you dictate text, information like your voice input and contact names are sent to Apple to help your Mac recognize what you’re saying.

wat10000 · 2026-04-07T14:46:42 1775573202

Elsewhere it says:

"When you use Dictation, your device will indicate in Keyboard Settings if your audio and transcripts are processed on your device and not sent to Apple servers. Otherwise, the things you dictate are sent to and processed on the server, but will not be stored unless you opt in to Improve Siri and Dictation."

And:

"Dictation processes many voice inputs on your Mac. Information will be sent to Apple in some cases."

In conclusion... I think they're trying to cover all their bases, but it sounds like things are processed locally as long as the hardware can handle it.

victorbjorklund · 2026-04-07T11:38:28 1775561908

No, that is not correct. It is running one hundred percent local. You can try it by turning off internet on your phone and try running it then. However, the built in model isn't as good, so this is probably better.

dwayne_dibley · 2026-04-07T06:37:22 1775543842

yup, this is how I 'type'

nidnogg · 2026-04-07T12:01:08 1775563268

Nothing comes close to LLM transcription though. I just tried this. I said "globe key dictation, does this work?". Here's the transcription, verbatim:

"Fucking dictation, does this work"

arkensaw · 2026-04-09T13:44:37 1775742277

fun fact: voice typing also worked excellently on Windows Phone, although only in the SMS app

vharish · 2026-04-07T06:49:24 1775544564

IMO.. one of the best. It was surprisingly good. Yet they can't even replicate in on their own systems

atlgator · 2026-04-06T22:06:19 1775513179

This thread is a support group for people who have each independently built the same macOS speech-to-text app.

theturtletalks · 2026-04-07T01:40:17 1775526017

I'm tracking them all here:

https://opensource.builders/alternatives/superwhisper

Just added Ghost Pepper, and you can actually create a skill.md with the features you need to build your own

bytesandbits · 2026-04-07T02:38:53 1775529533

Handy with parakeet is pretty awesome by the way!

perelin · 2026-04-07T05:08:34 1775538514

Agree. Slept on.

Wish they would do an ios version, but the creator already kind of dismissed it.

sipjca · 2026-04-07T06:15:50 1775542550

I just don't have the bandwidth to run another project, maintaining Handy is hard enough on it's own, especially for free!

I didn't just dismiss for no reason, I am a human! I have needs and I can't just sleeplessly stay in front of the computer putting out code. If I had more time I would, but alas.

Someone could easily vibe code an iOS version in a few hours. I could do the same but I do not have time to support it.

xyos · 2026-04-07T07:31:48 1775547108

Thank you for your work, I highly appreciate it!

sipjca · 2026-04-07T08:23:55 1775550235

Thank you!!

Barbing · 2026-04-07T10:13:45 1775556825

Unlimited free Parakeet on iOS: VoiceInk

https://apps.apple.com/us/app/voiceink-ai-dictation/id675143...

(I was searching the same as you before I found this last month)

bytesandbits · 2026-04-08T21:09:29 1775682569

fine. I will port it myself. Real-time, sub 100ms latency. Here

https://testflight.apple.com/join/myNP5XvU

MegagramEnjoyer · 2026-04-07T02:54:11 1775530451

i like handy a lot, so clean

dnlzro · 2026-04-07T03:00:18 1775530818

Another one to add (1.5k stars on GitHub): https://github.com/kitlangton/Hex

theturtletalks · 2026-04-07T08:30:41 1775550641

Added Hex and its features

earthnail · 2026-04-07T07:36:49 1775547409

Please add wordbird as well: https://github.com/tillahoffmann/wordbird

It has all the usual features, plus you can add project specific vocabulary in your repo. It detects the working folder based on the active window, reads a WORDBIRD.md file in that folder and corrects terms accordingly.

(My friend Till built it)

theturtletalks · 2026-04-07T08:29:12 1775550552

Added Wordbird and its features

Barbing · 2026-04-07T06:20:25 1775542825

Very nice. Two great features I'd suggest highlighting in two apps, one app of which you have listed.

1: livestream transcript directly into the cursor in real time (just like native macOS dictation)

2: show realtime transcript live in an overlay (still has to paste when done, unlike #1, but can still read live while dictating)

1- localvoxtral, 2- FluidVoice (bumping it to 7 features on your list)

theturtletalks · 2026-04-07T08:30:32 1775550632

Thank you, I have added localvoxtral[0] and fixed FluidVoice

0. https://github.com/T0mSIlver/localvoxtral

Barbing · 2026-04-07T10:21:02 1775557262

Awesome, thanks. Now it looks like five features are table stakes and there's no need to filter for, for example, speech to text. So, it would be interesting to see the differentiation, the why would I choose which one.

I see promise trying to get a bit more into curating by showing the top one or two or three picks for a given standout feature.

zgougou123 · 2026-04-07T07:38:02 1775547482

You could add foxsay, a great one : https://github.com/skulkworks/foxsay

theturtletalks · 2026-04-07T08:29:23 1775550563

Added Foxsay and its features

raybb · 2026-04-07T07:14:10 1775546050

Do any of the apps support taking actions as you talk without having to hit stop?

Like telling it to edit the text or remove a word.

foltik · 2026-04-07T03:01:48 1775530908

So... a vibe slop index to keep track of all the vibe slop apps?

The cherry on top: it’s completely broken! Enable the Context Awareness filter, the list shrinks. Now enable the Auto-pasting filter, the list grows back.

mulquin · 2026-04-07T06:30:42 1775543442

I wouldn't call it completely broken; Pressing buttons still does something, it looks like an OR filter instead of an AND. It should be updated to be an AND filter as that's more intuitive.

foltik · 2026-04-07T16:09:21 1775578161

If you squint, it looks kinda maybe superficially useful? But if you actually critically look at it, it makes no sense.

The categories are clearly LLM generated from the GhostPepper codebase, with vague low level descriptions and links to code. Most categories apply to every listed project.

The UI is the same tiny bit of LLM generated information displayed five different confusing ways. Like seriously, click on a project and you first see a bunch of haphazard feature cards, then a bunch of “feature ... active” rows. Looks fancy, but actually just noise. Textbook slop.

Better would be a simple awesome-style markdown page, with a feature matrix having categories and descriptions curated by a human that actually understands and cares about the domain.

Sorry if this is harsh, but passing off LLM output as “curation” is particularly insulting to me.

dcreater · 2026-04-07T04:17:08 1775535428

Welcome to modern software

thefourthchime · 2026-04-07T04:29:43 1775536183

hahah. It's slop all the way down.

v4nn4 · 2026-04-07T06:54:38 1775544878

The filters selection seems to return a union not an intersection which is a bit confusing, at least to me.

theturtletalks · 2026-04-07T08:31:09 1775550669

I’ve fixed this issue, please try it again when you get a chance

lizhang · 2026-04-07T03:20:20 1775532020

Can you add mine https://github.com/vorpus/D-scribe

karimf · 2026-04-07T00:02:45 1775520165

In the /r/macapps subreddit, they have huge influx of new apps posts, and the "whisper dictation" is one of the most saturated category. [0]

>“Compare” - This is the most important part. Apps in the most saturated categories (whisper dictation, clipboard managers, wallpaper apps, etc.) must clearly explain their differentiation from existing solutions.

https://www.reddit.com/r/macapps/comments/1r6d06r/new_post_r...

stingraycharles · 2026-04-07T09:29:07 1775554147

Seems like there’s also a huge influx of these apps as they’re relatively easy to make with LLMs.

That whole list of requirements there is actually a good thing that anyone who wants to make a new application should ask themselves.

vunderba · 2026-04-07T14:33:01 1775572381

This and 100% client-side PDF editors/tools. It's the new "hello world" for vibe-coding.

tpowell · 2026-04-06T23:19:19 1775517559

I cobbled my own together one night before I came across the thoughtfully-built KeyVox and got to talking shop with its creator. Our cups runneth over. https://github.com/macmixing/keyvox/

aroman · 2026-04-07T02:58:19 1775530699

I did mine on nixOS with a nice little indicator built into Noctalia.

It's remarkable how similar its performance is to Wispr Flow... and it runs locally...

fragmede · 2026-04-07T00:24:39 1775521479

Yeah, but mine... Oh. Hello. sighs It's been three weeks since I tried to add feature to my version of the app. I don't miss it. I like this new life. Sober.

hbbio · 2026-04-07T03:52:27 1775533947

In the most possible Apple fashion, I am waiting for MacOS 27 or 28 to have this builtin.

colechristensen · 2026-04-07T01:45:59 1775526359

My name is Cole and I have a speech to text app.

When I most recently abandoned it, the trigger word would fire one time in five.

lxe · 2026-04-07T00:26:00 1775521560

hahaha I’m glad I’m just a procedurally generated NPC

I built one for cross platform — using parakeet mlx or faster whisper. :)

brcmthrowaway · 2026-04-06T22:09:18 1775513358

Oh to be 20-something and do a bunch of free work for your portfolio again

obrajesse · 2026-04-06T22:54:46 1775516086

I'll have you know that I'm Matt's top contributor to Ghost Pepper and I'm nearly fifty

But I did it because I wanted it to work exactly the way I wanted it.

Also, for kicks, I (codex) ported it to Linux. But because my Linux laptop isn't as fast, I've had to use a few tricks to make it fast. https://github.com/obra/pepper-x

dotancohen · 2026-04-07T03:08:50 1775531330

I'll look at this, thank you. I haven't yet gotten around to vibe coding my own itch yet so maybe your scratching will do.

perelin · 2026-04-07T05:12:16 1775538736

I recently attended a agentic SWE workshop and the starter project was this, whispr style, local voice dictation app. Took everybody around 30mins. tbh: i was kinda impressed.

dcreater · 2026-04-07T04:15:28 1775535328

Its gotten so bad that its a meme on the macapps subreddit.

This is the unfortunate real face of open source. So many devs each making little sandcastles on their own when if efforts were combined, we could have had something truly solid and sustainable, instead of a litany of 90% there apps each missing something or the other, leaving people ending up using WisprFlow etc.

nidnogg · 2026-04-07T12:02:46 1775563366

NGL had me chuckling a bit there when I remembered I had one of these to code on my backlog

pmarreck · 2026-04-07T00:06:19 1775520379

Are there any better than Superwhisper? Because I haven't found any.

Tsarp · 2026-04-07T03:06:07 1775531167

https://carelesswhisper.app

rmac · 2026-04-07T00:42:22 1775522542

checking in

windows (kotlin multi platform) => https://github.com/maceip/daydream

parakeet-tdt-0.6b-v2

rmac · 2026-04-07T03:11:46 1775531506

now using Moonshine v2 Medium

hotword dict so no more "clawd" "dash" "dot com"

jannniii · 2026-04-07T04:53:09 1775537589

github.com/randomm/kuiskaus

goodroot · 2026-04-06T20:08:38 1775506118

Nice one! For Linux folks, I developed https://github.com/goodroot/hyprwhspr.

On Linux, there's access to the latest Cohere Transcribe model and it works very, very well. Requires a GPU though. Larger local models generally shouldn't require a subordinate model for clean up.

Have you compared WhisperKit to faster-whisper or similar? You might be able to run turbov3 successfully and negate the need for cleanup.

Incidentally, waiting for Apple to blow this all up with native STT any day now. :)

VorpalWay · 2026-04-06T22:12:04 1775513524

How does it compare to the more well established https://github.com/cjpais/handy? Are there any stand out features (for either option)? What was the reason for writing your own rather than using or improving existing software?

goodroot · 2026-04-06T22:17:00 1775513820

Not sure I know what you mean by IR...

But in this case I built hyprwhspr for Linux (Arch at first).

The goal was (is) the absolute best performance, in both accuracy & speed.

Python, via CUDA, on a NVIDIA GPU, is where that exists.

For example:

The #1 model on the ASR (automatic speech recognition) hugging face board is Cohere Transcribe and it is not yet 2 weeks old.

The ecosystem choices allowed me to hook it up in a night.

Other hardware types also work great on Linux due to its adaptability.

In short, the local stt peak is Linux/Wayland.

VorpalWay · 2026-04-06T22:23:11 1775514191

IR was a typo, meant "it" (fixed it). I blame the phone keyboard plus insufficient proof reading on my part.

If this needs nvidia CPU acceleration for good performance it is not useful to me, I have Intel graphics and handy works fine.

goodroot · 2026-04-06T22:28:59 1775514539

It works well with anything. :)

That said: If handy works, no need whatsoever to change.

LuxBennu · 2026-04-06T20:48:34 1775508514

I've been running whisper large-v3 on an m2 max through a self-hosted endpoint and honestly the accuracy is good enough that i stopped bothering with cleanup models. The bigger annoyance for me was latency on longer chunks, like anything over 30 seconds starts feeling sluggish even with metal acceleration. Haven't tried whisperkit specifically but curious how it handles longer audio compared to the full model.

goodroot · 2026-04-06T21:06:45 1775509605

Ah yeah, longform is interesting.

Not sure how you're running it, via whichever "app thing", but...

On resource limited machines: "Continuous recording" mode outputs when silence is detected via a configurable threshold.

This outputs as you speak in more reasonable chunks; in aggregate "the same output" just chunked efficiently.

Maybe you can try hackin' that up?

LuxBennu · 2026-04-06T21:47:35 1775512055

Yeah that makes sense, chunking on silence would sidestep the latency issue pretty cleanly. I've been running it through a basic fastapi wrapper so it just takes whatever audio blob gets thrown at it, no chunking logic on the server side. Might be worth adding a vad pass before sending to whisper though, would cut down on processing dead air too.

znagengast · 2026-04-07T17:19:29 1775582369

Maintainer of WhisperKit here, confirming we do exactly that for longform. We search for the longest "low energy" silence in the second half of the audio window and set the chunking point to the middle of that silence. It uses a version of the webrtc vad algorithm, and significantly speeds up longform because we can run a large amount of concurrent inference requests through CoreML's async prediction api. Whisper is also pretty smart with silent portions since the encoder will tell it if there are any words at all in the chunk, and simply stop predicting tokens after the prefill step - although you could save the ~100ms encoder run entirely with a good vad model, which our recently opensourced pyannote CoreML pipeline can do.

LuxBennu · 2026-04-07T22:36:30 1775601390

Oh nice, the pyannote coreml port is interesting. Last time I looked at pyannote it was pytorch only so getting it to run efficiently on apple silicon was kind of a pain. Does the coreml version handle diarization or just activity detection?

ericd · 2026-04-07T04:12:58 1775535178

Nice, I've been using Hyprwhspr on Omarchy daily for a while now, it's been awesome, thanks very much.

goodroot · 2026-04-07T17:41:29 1775583689

Thanks ericd! Glad to hear.

hephaes7us · 2026-04-06T20:27:26 1775507246

Thanks for sharing! I was literally getting ready to build, essentially, this. Now it looks like I don't have to!

Have you ever considered using a foot-pedal for PTT?

Apple incidentally already has native STT, but for some reason they just don't use a decent model yet.

goodroot · 2026-04-06T20:39:13 1775507953

They do, and they even have that nice microphone F5 key for it, and an ideal OS level API making the input experience >perfect<.

Apparently they do have a better model, they just haven't exposed it in their own OS yet!

https://developer.apple.com/documentation/speech/bringing-ad...

Wonder what's the hold up...

For footpedal:

Yes, conceptually it’s just another evdev-trigger source, assuming the pedal exposes usable key/button events.

Otherwise we’d bridge it into the existing external control interface. Either way, hooks are there. :)

jiehong · 2026-04-06T21:18:27 1775510307

The only issue with Apple models is that they do not detect languages automatically, nor switch if you do between sentences.

Parakeet does both just fine.

chrisweekly · 2026-04-06T21:46:28 1775511988

sorry, PTT?

serf · 2026-04-06T21:49:51 1775512191

push-to-talk.

chrisweekly · 2026-04-07T02:49:43 1775530183

pmarreck · 2026-04-07T00:08:04 1775520484

looks like there's a nearly identically named one for Hyprland

Also, wish it was on nixpkgs, where at least it will be almost guaranteed to build forever =)

primaprashant · 2026-04-06T21:53:18 1775512398

Speech-to-text has become integral part of my dev flow especially for dictating detailed prompts to LLMs and coding agents.

I have collected the best open-source voice typing tools categorized by platform in this awesome-style GitHub repo. Hope you all find this useful!

https://github.com/primaprashant/awesome-voice-typing

ArlenBales · 2026-04-07T05:14:32 1775538872

Can you explain how exactly dictation is used for development? I type about 120 WPM so typing is always going to be way faster for me than talking. Aside for accessibility, is dictation development for slower typers or is it more so you can relax on a couch while vibe coding? If this comes off as condescension it's not intended, I am genuinely out of the loop here.

KerrickStaley · 2026-04-07T05:22:35 1775539355

I think most people can speak faster than 120 WPM. For example this site says I speak at 343 WPM https://www.typingmaster.com/speech-speed-test/, and I self-measure 222 WPM on dense technical text.

thakoppno · 2026-04-07T05:36:58 1775540218

Micro machines guy could be vibe coding at an absurd rate.

mememememememo · 2026-04-07T06:50:23 1775544623

My LLM types at 2k WPM. So I ise that to talk to my LLMs

primaprashant · 2026-04-07T08:25:27 1775550327

For me personally, it's not really about typing speed. While I can type pretty fast and most likely I speak faster than typing, but typing and dictating are just different way of doing things for me. While the end result of both is same, but for me it's just like different way of doing things and it's not a competition between the two.

I regularly just sit down and often just describe whatever I'm trying to do in detail and I speak out loud my entire thought process and what kind of trade-offs I'm thinking, all the concerns and any other edge cases and patterns I have in my mind. I just prefer to speak out loud all of those. I regularly speak out loud for 5 to 10 minutes while sometimes taking some breaks in between as well to think through things.

I am not doing it just for vibe coding, I'm using it for everything. So obviously for driving coding agents, but also for in general, describing my thoughts for brainstorming or having some kind of like a critique session with LLMs for my ideas and thoughts. So for everything, I'm just using dictation.

One other benefit I think for me personally is that since I'm interacting with coding agents and in general LLMs a lot again and again every day, I end up giving much more context and details if I'm speaking out loud compared to typing. Sometimes I might feel a little bit lazy to type one or two extra sentences. But while speaking, I don't really have that kind of friction.

Zizizizz · 2026-04-07T07:24:12 1775546652

Most English speakers speak faster than 120 wpm so that's probably why people, especially those who can't type at speeds like you can, prefer it.

sgt · 2026-04-07T10:26:52 1775557612

Typing is considerably less energy intensive than speaking. At least it is for me. I save the speaking for meetings, etc.

cupcake-unicorn · 2026-04-07T00:09:11 1775520551

https://handy.computer/ already exists?

semiquaver · 2026-04-07T00:20:28 1775521228

I have a few qualms with this app:

1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.

2. It doesn't actually replace a USB drive. Most people I know e-mail files to themselves or host them somewhere online to be able to perform presentations, but they still carry a USB drive in case there are connectivity problems. This does not solve the connectivity issue.

3. It does not seem very "viral" or income-generating. I know this is premature at this point, but without charging users for the service, is it reasonable to expect to make money off of this?

Graziano_M · 2026-04-07T01:17:27 1775524647

I got that reference!

MegagramEnjoyer · 2026-04-07T02:56:47 1775530607

why does it need to generate money?

morelikeborelax · 2026-04-07T04:49:18 1775537358

This is the reply that was posted when Dropbox was first shown off on HN. It's a joke :)

MegagramEnjoyer · 2026-04-07T18:30:25 1775586625

ah I wasn't aware! thanks for explaining :D

smcleod · 2026-04-07T00:14:03 1775520843

Yeah props to Handy, really nice tool.

forbiddenvoid · 2026-04-07T00:17:45 1775521065

More than one solution can exist for the same problem.

itemize123 · 2026-04-07T09:04:16 1775552656

yes speech to text exists

ktimespi · 2026-04-07T01:02:20 1775523740

This is ideal for my use case, yeah. No need to fiddle around with another app's UI.

charlietran · 2026-04-06T20:00:54 1775505654

Thank you for sharing, I appreciate the emphasis on local speed and privacy. As a current user of Hex (https://github.com/kitlangton/Hex), which has similar goals, what are your thoughts on how they compare?

parhamn · 2026-04-06T21:03:22 1775509402

I see a lot of whisper stuff out there. Are these the same old OpenAI whispers or have they been updated heavily?

I've been using parakeet v3 which is fantastic (and tiny). Confused why we're still seeing whisper out there, there's been a lot of development.

daemonologist · 2026-04-06T21:20:19 1775510419

Whisper is still old reliable - I find that it's less prone to hallucinations than newer models, easier to run (on AMD GPU, via whisper.cpp), and only ~2x slower than parakeet. I even bothered to "port" Parakeet to Nemo-less pytorch to run it on my GPU, and still went back to Whisper after a couple of days.

71bw · 2026-04-07T06:56:56 1775545016

I'm also wondering whether or not it would be beneficiary for my workload to switch over to Parakeet. Problem is, I'm using a lot of lingo - and in Polish, as well! - so it's not exactly the best case and whisper (v3), so far, works.

goodroot · 2026-04-06T21:41:33 1775511693

Whisper is very good in many languages.

It's also in many flavours, from tiny to turbo, and so can fit many system profiles.

That's what makes it unique and hard to replace.

zackify · 2026-04-06T21:09:12 1775509752

same, even have kokoro for speech back to text for home assistant and parakeet on mac os through voice ink.

Also vibe coded a way to use parakeet from the same parakeet piper server on my grapheneos phone https://zach.codes/p/vibe-coding-a-wispr-clone-in-20-minutes

konaraddi · 2026-04-06T20:13:48 1775506428

That’s awesome! Do you know how it compares to Handy? Handy is open source and local only too. It’s been around a while and what I’ve been using.

https://github.com/cjpais/handy

JohnPDickerson · 2026-04-06T22:28:07 1775514487

Handy is an awesome project, highly recommended - many of our engineers and PMs use it! CJ, Handy's creator, recently joined us as a Builder in Residence at Mozilla.ai. So for those interested in deploying a more raw/lightweight approach to local speech-to-text (or other multimodal) models, feel free to check out llamafile - which includes whisperfile, a single-file whisper.cpp + cosmopolitan framework-based executable. We're hoping to build some bridges between the two projects as well. https://github.com/mozilla-ai/llamafile

cootsnuck · 2026-04-06T23:48:44 1775519324

Yup, Handy is the one that made me stop looking for local open source alternatives to Wispr Flow.

I'll give a shoutout as well to Glimpse: https://github.com/LegendarySpy/Glimpse

vunderba · 2026-04-06T21:52:13 1775512333

I’d also be interested to know what the impetus was for developing ghost-pepper, which looks relatively recent, given that Handy exists and has been pretty well received.

Extra bonus is that Handy lets add an automatic LLM post-processor. This is very handy for the Parakeet V3 model, which can sometimes have issues where it repeats words or makes recognition errors for example, duplicating the recognition of a single word a dozen dozen dozen dozen dozen dozen dozen dozen times.

rob · 2026-04-06T22:09:38 1775513378

Yep. Using Handy with Parakeet v3 + a custom coding-tailored prompt to post-process on my 2019 Intel Mac and it's been working great.

Once in a while it will only output a literal space instead of the actual translation, but if I go into the 'history' page the translation is there for me to copy and paste manually. Maybe some pasting bug.

alasano · 2026-04-06T23:11:10 1775517070

I think it's the same reasoning for anything these days.

"You know what would be useful?" followed by asking your LLM of choice to implement it.

Then again for a lot of scenarios it's your slop or someone else's slop.

I think the only difference is that I keep my own slop tools private.

swaptr · 2026-04-06T20:54:59 1775508899

Handy is awesome! I used it for quite a while before Claude Code added voice support. Solid software, very good linux and mac integration. Shoutout to Parakeet models as well, extremely fast and solid models for their relatively modest memory requirements.

kwakubiney · 2026-04-06T23:30:18 1775518218

I love it. I use it all the time to communicate to my agents via opencode.

youniverse · 2026-04-06T20:53:10 1775508790

I love and have been using handy for a while too, what we need is this for mobile apps I don't think there's any free apps and native dictation is not always fully local and not as good.

olup · 2026-04-06T22:08:51 1775513331

I use handy all day long as a software engineer, and recommended it to all of my team members. I love it.

stavros · 2026-04-06T21:08:05 1775509685

Handy is fantastic.

ericmcer · 2026-04-06T21:17:55 1775510275

I see quite a few of these, the killer feature to me will be one that fine tunes the model based on your own voice.

E.G. if your name is `Donold` (pronounced like Donald) there is not a transcription model in existence that will transcribe your name correctly. That means forget inputting your name or email ever, it will never output it correctly.

Combine that with any subtleties of speech you have, or industry jargon you frequently use and you will have a much more useful tool.

We have a ton of options for "predict the most common word that matches this audio data" but I haven't found any "predict MY most common word" setups.

sorenjan · 2026-04-06T21:25:46 1775510746

Whisper supports a prompt, you can put your "Donold" there.

https://developers.openai.com/cookbook/examples/whisper_prom...

MattHart88 · 2026-04-06T21:20:29 1775510429

I've found the "corrections" feature works well for most of the jargon and misspelling use cases. Can you give it a try and let me know edge cases?

bonkler59 · 2026-04-06T22:28:35 1775514515

My experience is that Aqua voice does a good job of this with custom dictionary and replacements.

ipsum2 · 2026-04-06T20:06:45 1775506005

Parakeet is significantly more accurate and faster than Whisper if it supports your language.

yeutterg · 2026-04-06T20:26:15 1775507175

Are you running Parakeet with VoiceInk[0]?

[0]: https://github.com/beingpax/VoiceInk

ipsum2 · 2026-04-06T22:23:14 1775514194

I'm using https://github.com/senstella/parakeet-mlx library.

zackify · 2026-04-06T21:07:33 1775509653

i am, working great for a long time now

rahimnathwani · 2026-04-06T20:38:01 1775507881

Right, and if you're on MacOS you can use it for free with Hex: https://github.com/kitlangton/Hex

lloyd-christmas · 2026-04-06T22:22:10 1775514130

Or write your own custom one with the library that backs it: https://github.com/FluidInference/FluidAudio

I did that so that I could record my own inputs and finetune parakeet to make it accurate enough to skip post-processing.

rahimnathwani · 2026-04-07T00:01:53 1775520113

There's a fork of FluidAudio that supports the recent Cohere model: https://github.com/altic-dev/FluidAudio/tree/B/cohere-coreml...

It's used by this dictation app: https://github.com/altic-dev/FluidVoice/

treetalker · 2026-04-06T20:26:29 1775507189

I have been using Parakeet with MacWhisper's hold-to-talk on a MacBook Neo and it's been awesome.

totetsu · 2026-04-07T02:53:44 1775530424

Parakeet supports japanese now, but I cant find a version ported to apple silicone yet.

obrajesse · 2026-04-06T21:31:51 1775511111

And indeed, Ghost Pepper supports parakeet v3

kushalpandya · 2026-04-07T01:48:04 1775526484

Speecg-to-text is basically AI version of Todo app that we used to build every week when new frontend framework would release.

__mharrison__ · 2026-04-06T21:34:04 1775511244

Cool, I've been doing a lot of "coding" (and other typing tasks) recently by tapping a button on my Stream Deck. It starts recording me until I tap it again. At which point, it transcribes the recording and plops it into the paste buffer.

The button next to it pastes when I press it. If I press it again, it hits the enter command.

You can get a lot done with two buttons.

coldfoundry · 2026-04-07T04:45:37 1775537137

This is exactly what I am building right now, Stream Deck with two buttons too (push to talk and enter)! It's a sweet little pet project, and has been a blast to build so far. Excited to finally add it to my workflow once its working well.

nidnogg · 2026-04-07T11:59:43 1775563183

This got me thinking that the smaller these local first LLMs get - the more they're gonna looking the next bread and butter of app dev. Reminds me how Electron gained a lot of traction for making it easy to package prettier apps. At the measly cost of gigabytes of RAM, give or take.

ljlolel · 2026-04-07T12:02:27 1775563347

Exactly, everything will become ClaudeVM https://jperla.com/blog/the-future-is-claudevm

fiatpandas · 2026-04-07T00:26:34 1775521594

The clean up prompt needs adjusting. If your transcription is first person and in the voice of talking to an AI assistant, it really wants to “answer” you, completing ignoring its instructions. I fiddled with the prompt but couldn’t figure out how to make it not want to act like an AI assistant.

mathis · 2026-04-06T20:40:58 1775508058

If you don't feel like downloading a large model, you can also use `yap dictate`. Yap leverages the built-in models exposed though Speech.framework on macOS 26 (Tahoe).

Project repo: https://github.com/finnvoor/yap

marktolson · 2026-04-07T09:12:50 1775553170

I got it to transcribe this: "Create tests and ensure all tests pass" and instead of transcribing exactly what I said it outputs nonsense around "I am a large language model and I cannot create and execute tests".

Other than that issue I like it.

raybb · 2026-04-06T23:37:07 1775518627

Would also like to know how it compares to https://github.com/openwhispr/openwhispr

I like that openwhisper lets me do on device and set a remote provider.

mft_ · 2026-04-08T14:39:56 1775659196

Does it show your spoken words on the screen live (i.e. streaming) or does it wait until you’ve finished speaking?

I find it very helpful to see my words live - for some reason it helps my simple brain structure what I’m saying, and I’m much more fluent as a result.

I went on a mission a few weeks ago and tried every freely available MacOS STT app I could find (and there are lots of them) - but none I tried had this feature and was otherwise satisfactory. (I vibe-coded a PoC which could do this, so it’s definitely possible.)

snickell · 2026-04-07T03:32:50 1775532770

Can somebody help me understand how they use these, I feel like I'm missing something or I'm bad at something?

I only spent 10 minutes with Handy, and a similar amount of time with SuperWhisper, so pretty ignorant. I tried it both with composing this comment, and in a programming session with Codex. I was slightly frustrated to not be hands free, instead of typing, my hands were having to press and release a talk button (option-space in handy, right-command in superwhisper), but then I couldn't submit, so I still had to click enter with Codex.

Additionally, for composing this message, I'm using the keyboard a ton because there's no way I can find to correct text I've typed. Do other people get really reliable and don't need backspace anymore? Or.... what text do you not care enough to edit? Notes maybe?

My point of comparison is using Dragon like 15 years ago. TBH, while the recognition is better (much better) on handy/superwhisper, everything else felt MUCH worse. With dragon, you are (were?) totally hands free, you see text as you say it, and you could edit text really easily vocally when it made a mistake (which it did a fair bit, admittedly). And you could press enter and pretty functionally navigate w/o a keyboard too.

Its weird to see all these apps, and they all have the same limitations?

acjacobson · 2026-04-07T14:00:48 1775570448

Nice app! Feedback since you asked: The most obvious must-have feature IMO is to paste automatically. Don't require me to hit a shortcut (or at least make it configurable)

The next most critical thing I think is speed and in my tests it's just a little bit slower than other solutions. That matters a lot when it comes to these tools.

The third thing, more of a nice to have is controlling formatting. By this I mean - say a few sentences, then "new line" and the model interprets "new line" as formatting, not as literal text.

hyperhello · 2026-04-06T20:55:01 1775508901

Feature request or beg: let me play a speech video and transcribe it for me.

MattHart88 · 2026-04-06T21:05:46 1775509546

I like this idea and it should work -- whatever microphone you have on should be able to hear the speaker. LMK if not (e.g., are you wearing headphones? if so, the mic can't hear the speaker)

rcarmo · 2026-04-06T21:53:45 1775512425

Not sure why I should use this instead of the baked-in OS dictation features (which I use almost daily--just double-tap the world key, and you're there). What's the advantage?

qq66 · 2026-04-06T21:56:17 1775512577

I haven't used this one but WisprFlow is vastly better than the built-in functionality on MacOS. Apple is way behind even startups, even for fundamental AI functionality like transcribing speech

ibero · 2026-04-06T21:58:40 1775512720

WisprFlow has a lot of good recommendations behind it but the fact they used Delve for SOC2 compliance gives me major pause.

janalsncm · 2026-04-06T22:28:56 1775514536

The fact that a company could slurp up all of your data and then use Delve for their SOC2 is a great reason to use local models.

jonwinstanley · 2026-04-06T22:00:55 1775512855

I use the baked in Apple transcription and haven't had any issues. But what I do is usually pretty simple.

What makes the others vastly better?

qq66 · 2026-04-10T18:29:23 1775845763

I'm speaking for >1 minute and including bulleted lists, etc. WisprFlow gets all of the bulleted lists formatted correctly, and I'm not saying things like "Bullet 1" -- just speaking as I'd speak to a person.

MattDamonSpace · 2026-04-06T22:25:12 1775514312

I’ve rarely had macOS TTS produce a sentence I didn’t have to edit

Whisper models I barely bother checking anymore

rane · 2026-04-07T05:38:19 1775540299

- Way more accurate, especially with technical jargon. Try saying JSON as part of a sentence to macOS dictation and see what comes out.

- macOS dictation mutes other sounds while it's running. This is a deal-breaker for me.

nidnogg · 2026-04-07T11:57:55 1775563075

I really like the project and am eager to try and fit this into some of my workflows. However, this bothered me a bit:

"All models run locally, no private data leaves your computer. And it's spicy to offer something for free that other apps have raised $80M to build."

I’d straight up drop the comparison to big AI labs. This isn’t rebellious or subversive, it’s downstream of a ton of already-funded work. Calling it “spicy” is a bit misframed.

ghm2199 · 2026-04-07T02:50:01 1775530201

I've been using handy since a month and its awesome. I mainly use it with coding agents or when I don't want to type into text boxes. How is this different?

Part of the reason handy is awesome is because it uses some of the same rust infra for integrating with the model, so that actually makes it possible to use the code as a library in android or iOS. I have an android app that runs on a local model on the phone too using this.

boudra · 2026-04-07T02:47:17 1775530037

Interesting, I'm surprised you went with Whisper, I found Parakeet (v2) to be a lot more accurate and faster, but maybe it's just my accent.

I implemented fully local hands free coding with Parakeet and Kokoro: https://github.com/getpaseo/paseo

jwr · 2026-04-07T07:17:45 1775546265

I currently use MacWhisper and it is quite good, but it's great to see an alternative, especially as I've been looking to use more recent models!

I hope there will be a way to plug in other models: I currently work mostly with Whisper Large. Parakeet is slightly worse for non-English languages. But there are better recent developments.

miki123211 · 2026-04-07T06:34:14 1775543654

What do you actually use for STT, particularly if you prize performance over privacy and are comfortable using your own API keys?

I was on WhisperFlow for a while until the trial ran out, and I'm really tempted to subscribe. I don't think I can go back to a local solution after that, the performance difference is insane.

k9294 · 2026-04-07T07:54:17 1775548457

Try ottex.ai - it has an OpenRouter like gateway with most STT models on the market (Gemini, OpenAI, Groq, Deepgram, Mistral, AssemblyAI, Soniox), so you can try them all and choose what works best for you.

My favorites are Gemini 3 Flash and Mistral Voxtral Transcribe 2. Gemini when I need special formatting and clean-up, and Voxtral when I need fast input (mostly when working with AI).

ianmurrays · 2026-04-07T06:52:12 1775544732

I had Claude make this hammerspoon config + daemon that does pretty much the same, in case anyone is interested.

https://github.com/ianmurrays/hammerspoon/blob/main/stt.lua

ezVoodoo · 2026-04-07T11:41:06 1775562066

Hi, nice project! Quick question, when I speak Chinese language, why it output English as translated output? I was using the multilingual (small) model. Do I need to use the Parakeet model to have Chinese output? Thx.

maxmorrish · 2026-04-07T01:50:59 1775526659

love seeing more local-first tools like this. feels like theres been a real shift since the codebeautify breach last year, people are actually thinking about where there data goes now. nice work on keeping it all on device

aristech · 2026-04-06T20:57:10 1775509030

Great job. How about the supported languages? System languages gets recognised?

MattHart88 · 2026-04-06T21:07:25 1775509645

Thanks! We currently have 2 multi-lingual options available: - Whisper small (multilingual) (~466 MB, supports many languages) - Parakeet v3 (25 languages) (~1.4 GB, supports 25 languages via FluidAudio)

janalsncm · 2026-04-06T22:24:29 1775514269

I think the jab at the bottom of the readme is referring to whispr flow?

https://wisprflow.ai/new-funding

pdyc · 2026-04-07T03:12:02 1775531522

interesting, i wanted something like this but i am on linux so i modified whisper example to run on cli. Its quite basic, uses ctrl+alt+s to start/stop, when you stop it copies text to clipboard that's it. Now its my daily driver https://github.com/newbeelearn/whisper.cpp

Supercompressor · 2026-04-06T21:16:40 1775510200

I've been looking for the opposite - wanting to dump text and it be read to me, coherently. Anyone have good recommendations?

realityfactchex · 2026-04-06T21:39:02 1775511542

Sure, Chatterbox TTS Server is rather high quality: https://github.com/devnen/Chatterbox-TTS-Server

You could hook it up to some workflow over the local API depending on how you want to dump the text, but the web UI is good too.

The Show HN by the author was at: https://news.ycombinator.com/item?id=44145564

Supercompressor · 2026-04-06T22:02:44 1775512964

Appreciated - thank you.

tito · 2026-04-06T22:16:17 1775513777

This is great. I'm typing this message now using Ghost Pepper. What benefits have you seen from the OCR screen sharing step?

guzik · 2026-04-06T21:00:17 1775509217

Sadly the app doesn't work. There is no popup asking for microphone permission.

EDIT: I see there is an open issue for that on github

ttul · 2026-04-06T21:17:43 1775510263

And many people are mailing in Codex and Claude Code generated PRs - myself included. Fingers crossed, I suppose.

MattHart88 · 2026-04-06T21:57:01 1775512621

Thanks to everyone who submitted PRs! The fix is merged, new version is up.

kingofbits · 2026-04-09T10:23:06 1775730186

Nicely done! Ive been abusing chatgpt's overlay window for this, until now

pmarreck · 2026-04-07T00:05:18 1775520318

How does this compare with Superwhisper, which is otherwise excellent but not cheap?

jannniii · 2026-04-07T04:52:31 1775537551

Oh dear, why does it not use apfel for cleanup? No model download necessary…

gegtik · 2026-04-06T21:15:25 1775510125

how does this compare to macos built in siri TTS, in quality and in privacy?

realityfactchex · 2026-04-06T21:36:30 1775511390

Exactly my question. I double-tap the control button and macOS does native, local TTS dictation pretty well. (Similar to Keyboard > Enable Dictation setting on iOS.)

The macOS built-in TTS (dictation) seems better than all the 3rd party, local apps I tried in the past that people raved about. I have tried several.

Is this better somehow?

If the 3rd party apps did streaming with typing in place and corrections within a reasonable window when they understand things better given more context, that would be cool. Theoretically, a custom model or UX could be "better" than what comes free built into macOS (more accurate or customizable).

But when I contacted the developer of my favorite one they said that would be pretty hard to implement due to having to go back and make corrections in the active field, etc.

I assume streaming STT in these utilities for Mac will get better at some point, but I haven't seen it yet (been waiting). It seems these tools generally are not streaming, e.g. they want you to finish speaking first before showing you anything. Which doesn't work for me when I'm dictating. I want to see what I've been saying lately, to jog my memory about what I've just said and help guide the next thing I'm about to say. I certainly don't want to split my attention by manually toggling the control (whether PTT or not) periodically to indicate "ok, you can render what I just said now".

I guess "hold-to-talk" tools are for delivering discrete, fully formed messages, not for longer, running dictation.

AFAICT, TFA is focused on hold-to-talk as the differentiator, over double-tap to begin speaking and double-tap to end speaking?

realityfactchex · 2026-04-07T03:21:52 1775532112

s/TTS/STT/

purplehat_ · 2026-04-06T21:35:41 1775511341

Hi Matt, there's lots of speech-to-text programs out there with varying levels of quality. 100% local is admirable but it's always a tradeoff and users have to decide for themselves what's worth it.

Would you consider making available a video showing someone using the app?

semiquaver · 2026-04-06T21:48:11 1775512091

therealdeal2020 · 2026-04-07T10:23:23 1775557403

btw I know at least a dozen doctors that still pay for software like this. I think doctors are THE profession that likes to use speech-to-text all day every day

leeeeep101 · 2026-04-09T08:57:58 1775725078

i also did this dictatorflow lee101/voicetype i open sourced it also. nice. might be good reference :)

imazio · 2026-04-07T02:57:18 1775530638

is this the support group for people building speech-to-text apps?

I built https://yakki.ai

No regrets so far! XP

vaulpann · 2026-04-07T01:23:20 1775525000

very cool - huge open source drop!

thatxliner · 2026-04-06T23:56:13 1775519773

why isn't the cleanup done on the transcription (as opposed to screen record)

dakila5 · 2026-04-06T22:58:57 1775516337

MacWhisper is also a good one

douglaswlance · 2026-04-06T21:49:33 1775512173

does it input the text as soon as it hears it? or does it wait until the end?

sorkhabi · 2026-04-07T04:20:40 1775535640

Well done

romeroej · 2026-04-06T21:51:19 1775512279

always mac. when windows? why can you just make things multios

SquareWheel · 2026-04-07T03:10:12 1775531412

Windows has a native (cloud-based) dictation software built-in[1], so there's likely less demand for it. Nonetheless, there are still a handful of community options available to choose from.

[1] https://support.microsoft.com/en-us/windows/use-voice-typing...

naikrovek · 2026-04-06T23:14:25 1775517265

Because like all other modern Macs, the GPU in my Mac uses the same API as the GPU in your Mac.

Also, on a Mac with 32GB of RAM, 24GB of that (75%) is available to the GPU, and that makes the models run much faster. On my 64GB MacBook Pro, 48GB is available to the GPU. Have you priced an nvidia GPU with 48GB of RAM? It’s simply cheaper to do this on Macs.

Macs are just better for getting started with this kind of thing.

patja · 2026-04-06T23:47:37 1775519257

Fair enough for GPU-intensive stuff like running Qwen locally. But do you really need a GPU for decent local TTS? I run parakeet just on CPU.

cootsnuck · 2026-04-06T23:50:51 1775519451

Handy has Windows support. https://handy.computer/

patja · 2026-04-06T23:44:40 1775519080

I've been using Chirp which uses parakeet on Windows. Learned about it here:

https://news.ycombinator.com/item?id=45930659

Works great for me!