Language models are deterministic unless you add random input. Most inference tools add random input (the seed value) because it makes for a more interesting user experience, but that is not a fundamental property of LLMs. I suspect determinism is not the issue you mean to highlight.
Sort of. They are deterministic in the same way that flipping a coin is deterministic - predictable in principle, in practice too chaotic. Yes, you get the same predicted token every time for a given context. But why that token and not a different one? Too many factors to reliably abstract.
>Yes, you get the same predicted token every time for a given context. But why that token and not a different one? Too many factors to reliably abstract.
Fixed input-to-output mapping is determinism. Prompt instability is not determinism by any definition of this word. Too many people confuse the two for some reason. Also, determinism is a pretty niche thing that is only necessary for reproducibility, and prompt instability/unpredictability is irrelevant for practical usage, for the same reason as in humans - if the model or human misunderstands the input, you keep correcting the result until it's right by your criteria. You never need to reroll the result, so you never see the stochastic side of the LLMs.
>Fixed input-to-output mapping is determinism. Prompt instability is not determinism by any definition of this word
It really depends on your perspective.
In the real world, everything runs on physics, so short of invoking quantum indeterminacy, everything is deterministic - especially software, including things like /dev/random and programs with nasty race conditions. That makes the term useless.
The way we use "determinism" in practice depends contextually on how abstracted our view of the system is, how precise our description of our "inputs" can be, and whether a chunked model can predict the output. Many systems, while technically a fixed input/output mapping, exhibit an extreme and chaotic sensitivity to initial conditions. If the relevant features of those initial conditions are also difficult to measure, or cannot be described at our preferred level of abstraction, then actually predicting ("determining") the output is rendered impractical and we call it "non-deterministic". Coin tosses, race conditions, /dev/random - all fit this description.
And arguably so do LLMs. At the "token" level of abstraction, LLMs are indeed deterministic - given context C, you will always get token T. But at the "semantic" level they are chaotic, unstable - a single token changed in the input, perhaps even as minor as an extra space after a period, can entirely change the course of the output. You understand this, of course. You call it "prompt instability" and compare it to human performance. But no one would call humans deterministic either!
That is what people mean when they say LLMs are not deterministic. They are not misusing the word. It just depends on your perspective.
You mean "corporate inference infrastructure", not LLMs. The reason for different outputs at t=0 is mostly batching optimization. LLMs themselves are indifferent to that, you can run them in a deterministic manner any time if you don't care about optimal batching and lowest possible inference cost. And even then, e.g. Gemini Flash is deterministic in practice even with batching, although DeepMind doesn't strictly guarantee it.
This is all currently irrelevant, making it work well is a much bigger problem. As soon as there's paying demand for reproducibility, solutions will appear. This is a matter of business need, not a technical issue.
It always feels like I just have to figure out and type the correct magical incantation, and that will finally make LLMs behave deterministically. Like, I have to get the right combination of IMPORTANT, ALWAYS, DON'T DEVIATE, CAREFUL, THOROUGH and suddenly this thing will behave like an actual computer program and not a distracted intern.
Actually at a hardware level floating point operations are not associative. So even with temperature of 0 you’re not mathematically guaranteed the same response. Hence, not deterministic.
You are right that as commonly implemented, the evaluation of an LLM may be non deterministic even when explicit randomization is eliminated, due to various race conditions in a concurrent evaluation.
However, if you evaluate carefully the LLM core function, i.e. in a fixed order, you will obtain perfectly deterministic results (except on some consumer GPUs, where, due to memory overclocking, memory errors are frequent, which causes slightly erroneous results with non-deterministic errors).
So if you want deterministic LLM results, you must audit the programs that you are using and eliminate the causes of non-determinism, and you must use good hardware.
This may require some work, but it can be done, similarly to the work that must be done if you want to deterministically build a software package, instead of obtaining different executable files at each recompilation from the same sources.
Only that one is built to be deterministic and one is built to be probabilistic. Sure, you can technically force determinism but it is going to be very hard. Even just making sure your GPU is indeed doing what it should be doing is going to be hard. Much like debugging a CPU, but again, one is built for determinism and one is built for concurrency.
GPUs are deterministic. It's not that hard to ensure determinism when running the exact same program every time. Floating point isn't magic: execute the same sequence of instructions on the same values and you'll get the same output. The issue is that you're typically not executing the same sequence of instructions every time because it's more efficient run different sequences depending on load.
It's not even hard, just slow. You could do that on a single cheap server (compared to a rack full of GPUs). Run a CPU llm inference engine and limit it to a single thread.
I strongly encourage people to consider alternatives to cars. Personal electric vehicles have gotten insanely capable in the last few years. I live in a city, and last year I decided to purchase an electric scooter (Hiboy x300). It was so good that I decided to sell my car in favor of it. I can get around where I need to faster than I could by car in many cases, I don't have to worry about parking, and I can just "fill up" for cents in my apartment. I've even taken it out of state on the train to visit family.
This year I decided to pair it with an acoustic bike (Priority 600). The two main issues I have with the scooter are cargo capacity (can only carry what I can fit in my backpack) and that it's not exactly water proof, and so I was hesitant to ride in bad weather. For me, an acoustic bike pairs well with the electric scooter because I can carry basically any normal cargo, and it reduces the weather issue to only having to worry about what I'm wearing.
And yes, I live somewhere it snows in the winter. I bought a heated jacket, it's much cheaper than a car.
If you need to type in a password to unlock your keychain (e.g. default behavior for gpg-agent), then signing commits one at a time constantly is annoying.
Does "own" try to sign working copy snapshot commits too? That would greatly increase the number and frequency of signatures.
It feels like one of those things where if you think you want it and you can imagine how you'll actually use it, you'll use it a ton. I had been on the fence about getting an e-paper device for a long time. When I heard the details on the Daylight Computer, I knew it was exactly what I wanted. I pre-ordered it within hours and I have probably used it more than any other devices I own since it arrived a year ago :P
I checked it out and they conspicuously omit the thickness from the FAQ "dimensions" answer. They also avoid any photos of the product that clearly show the thickness. So, guessing it's pretty thick?
I guess? I don't know the exact thickness either, but I held it up sideways behind my Samsung S10 and it was maybe a millimeter or so thicker, so it's not huge. Like 9mm-1cm. I have never thought much about the thickness of it
I am currently typing from a Daylight Computer that I've been using as my primary mobile device (over a laptop or smartphone) for a bit over a year now. I've used it so much the edges have started to peel off a bit where I hold it. Easily worth the money for me. Days of battery life, buttery smooth animations, reflective e-paper display, full android with an unlockable bootloader, it's great.
Not to make an argument against parrots understanding, but humans understand noises before they mimic them. Children are often able to learn and express themselves in sign language (if taught obviously) earlier than they can learn to speak, and they can respond to spoken word in sign language before they can speak.
Language also has a lot to do with what we do. We do more complex things than animals, so we say more complex things than animals. The biggest difference in the evolution of human language versus the evolution of elephant language might just be that we have thumbs.
reply