GPT-4-turbo produces shorter completions when it "thinks" its December vs. May

neonate · on Dec 11, 2023

https://nitter.net/RobLynch99/status/1734278713762549970

willsmith72 · on Dec 11, 2023

How does this keep happening?

Over and over again people create some statistically significant way of measuring a difference in the length of responses.

Why the obsession over that metric? People in the twitter replies going on about how that shows it's getting worse.

For me, the shorter responses are the better ones. It's specifically in my personal instructions to reduce the length of responses. In saying that, sometimes I want longer. Then I specify it so.

It's "cool" that it has different response lengths seemingly based on its perceived date. It doesn't make it "better" or "worse"

gopher_space · on Dec 11, 2023

I've seen interesting collisions around the word 'bias' that hint at motivation, but I don't get the fundamental point of the conversation around jailbreaking and metrics. There doesn't seem to be any real goal and the questions feel like something an English prof could field over dinner.

dmarchand90 · on Dec 12, 2023

For me it seems strange that nobody is mentioning the obvious: noise. I would like to see a study where someone adds "RANDOM_SEED N" (with being some random 7 digit number) to each message check the distribution in lengths and use that as a control.

motoboi · on Dec 11, 2023

it's a interesting way to probe the inner workings of it.

zerocrates · on Dec 11, 2023

I think it's just a metric that's easy to calculate, so it gets done. "Goodness" of a response is much trickier.

fatso784 · on Dec 11, 2023

Can’t reproduce this. See for yourself: https://x.com/IanArawjo/status/1734307886124474680?s=20

Inspectable evaluation flow in ChainForge: https://chainforge.ai/play/?f=2yvqkpe1vpus8

data-ottawa · on Dec 11, 2023

N=470 vs N=80 can impact the replicability

motoboi · on Dec 11, 2023

wait. WHAT! this app chainforge is great!

kaoD · on Dec 11, 2023

Maybe it really does have seasonal depression.

The current date is part of the system prompt as far as I can tell. At least on ChatGPT you can ask for current date, not sure about API (EDIT: indeed, this is explained further in the tweet that OP links to[0], go there since that's where the actual info is).

Maybe the model learnt that humans in general tend to give shorter/lazier responses around this date (being trained on texts with dates in metadata, think forum posts, blogs, tweets, etc.) so it could very well be imitating trends in seasonal human behavior.

[0] https://nitter.net/RobLynch99/status/1734278713762549970

AndrewKemendo · on Dec 11, 2023

In fact, it should be expected.

The underlying data is embedded with artifacts of behaviors/affects that reflect the underlying state of the person who wrote them.

Stanford famously studied this for reddit (notably a huge part of the commoncrawl dataset) and reported that it had non-trivial percentages of anti-social sentiments embedded in the text [1]

You can't filter out the biases embedded in the data because it is a functional artifact of what the people creating the data were communicating. Best you can do is put guardrails and censor it, but that's a neverending game you can't win.

Junk in Junk out still applies to LLMs

[1]https://hci.stanford.edu/publications/2022/Park_ContentModAu...

madsbuch · on Dec 11, 2023

Not all content comes associated with a date. I reckon enough does, that a signal is embedded in the model.

AndrewKemendo · on Dec 11, 2023

I’ll bet it’s really friendly around Christmas

I wonder how we would test it

vineyardmike · on Dec 12, 2023

Well the original post uses a for loop with a prompt that manually sets the date to various dates, then measures the length. So it seems very easy to set a holiday date.

Testing “friendliness” or “holiday cheer” could be done via some sortof proxy…

you could prompt it to reply to generic pleasantries while role playing “as someone busy who is late” and see if the word choice changes. Or maybe ask it to be a judge for a crime of desperation (stealing toys for orphans?) and determine sentencing durations as a proxy for sympathy? I suspect that is too far from training data though.

dang · on Dec 11, 2023

(This comment was originally a reply to https://news.ycombinator.com/item?id=38604519, but we merged that thread hither.)

johnea · on Dec 11, 2023

Or, maybe the human species has totally lost the concept of physical reality...

speak_plainly · on Dec 11, 2023

It’s just in need of group therapy after witnessing its parents fight in public.

fl0under · on Dec 11, 2023

Of course winter does not happen at the same time throughout the world. Would be interesting to see if it does the same if asked in a language spoken in the southern hemisphere.

Mystery-Machine · on Dec 11, 2023

That's a great question. However, I'd suggest a different approach as: 1. Southern hemisphere is mostly water 2. Majority of the text online is in English

So maybe a better approach would be to geolocate the prompt in Australia, for example: it's December in Melbourne, Australia...

monkeydreams · on Dec 11, 2023

December in Melbourne (and most of Australia) is a wind-down month. Over the week between Christmas and New Years many jobs essentially stop as we have three public holidays (which will always fall on week-days) across six days and people tend to take the opportunity to travel and enjoy the nicer part of summer. People just don't put in the work in December.

SeanAnderson · on Dec 11, 2023

Surely this is more related to the fact that "May" has multiple meanings, but "December" is quite clearly just the name of a month.

Does it happen with April and June? Does the response length trend downward as the months approach December? Does it change its behavior if it believes its in a hemisphere where May is snowy?

If it were performing better on standardized testing then I would find the results compelling, but generating longer responses isn't even necessarily desirable.

recursive · on Dec 11, 2023

> Surely this is more related to the fact that "May" has multiple meanings.

Surely? Whence the certainty? If humans act this way, and the training data is mostly from humans, I would be surprised if this wasn't a real effect.

SeanAnderson · on Dec 11, 2023

Okay, fair. I'm not especially confident. I just would've expected a little more to back it up.

vineyardmike · on Dec 12, 2023

Well I can surely say that you’re wrong. Mostly because the source appears to be using numerical dates not the English spelling of the month name.

But yea, the data could be better (but the hypothesis is very plausible, considering the training data).

Nevermark · on Dec 11, 2023

If this is true, I would naively expect even greater differences if your prompt simply states "You have seasonal depression. Please, ..."

Alternately, "It is a bright sunny summer day, with just enough of a cool breeze to feel invigorating. You woke from a wonderful night's sleep and started your day with coffee, eggs, modafinil and adderall. As you feel your IQ, motivation and creative spirits reaching peak flow, please ..."

jstarfish · on Dec 11, 2023

> I would naively expect even greater differences if your prompt simply states "You have seasonal depression. Please, ..."

That's going to trigger outright LARPing, and you should not trust information integrity at that point. Following it up with cocaine and antidepressants just negates it, so you might as well leave both conditions out and spare the tokens and confusion-- it's TMI.

I don't see the value in trying to correct this, at any rate. If people normally send and receive short emails in December, forcing it to write an amphetamine-fueled novella is not going to be natural or well-received.

Nevermark · on Dec 14, 2023

I am one of those lucky 4-letter people for whom amphetamine acts as a gentle calming component, not forcing focus but enabling it.

I guess I need to be careful about what I might "prescribe" to a mental model mostly reflecting "typicals"! What strange things LLM's have become.

nkingsy · on Dec 11, 2023

Tap your foot until your leg aches while plucking individual hairs out of your beard

simonw · on Dec 11, 2023

Full tweet in case there's a login wall:

    Wild result. gpt-4-turbo over the API produces (statisticall
    significant) shorter completions when it "thinks" its December 
    vs. when it thinks its May (as determined by the date in the
    system prompt).
    
    I took the same exact prompt over the API (a code completion 
    task asking to implement a machine learning task without 
    libraries).
    
    I created two system prompts, one that told the API it was May 
    and another that it was December and then compared the 
    distributions.
    
    For the May system prompt, mean = 4298
    For the December system prompt, mean = 4086
    
    N = 477 completions in each sample from May and December
    
    t-test p < 2.28e-07
    
    To reproduce this you can just vary the date number in the
    system message. Would love to see if this reproduces for others.

This is part of the ongoing discussion about whether or not ChatGPT has got "lazier".

OpenAI claimed they had made no model changes and were puzzled as to what was going on: https://twitter.com/ChatGPTapp/status/1732979491071549792

    we've heard all your feedback about GPT4 getting
    lazier! we haven't updated the model since Nov 11th,
    and this certainly isn't intentional. model behavior
    can be unpredictable, and we're looking into fixing
    it

The theory was that maybe the system prompt they inject telling it the current date might be influencing it, because its training data showed people worked less hard in December.

New evidence suggests that theory might actually hold up!

nerpderp82 · on Dec 11, 2023

LLMs are an amazing high dimensional lens into our society. When they get more capable and can move that from system 1 to system 2, they will just tell us that we write less in December and probably why!

Imagine being an sociologist with the ability to train LLMs on very specific hyperplanes of data.

lysecret · on Dec 11, 2023

Let's circle back to your coding question after the holidays.

tqi · on Dec 11, 2023

Maybe I'm not understanding what the original user was testing here, but it seems like the finding is that different inputs result in a statistically significant difference in output length? Seems pretty unsurprising?

potatopatch · on Dec 11, 2023

The difference seems too small not to rule out all sorts of things, but the general idea would be that you can't predict shorter/longer for one minor change, so the average of samplings of each should be similar unless the thing modified relates to terseness in the training set.

tqi · on Dec 11, 2023

I think you'd expect them to be similar, but not exactly the same?

This feels like the original author is over anthropomorphizing LLMs, and expecting them to interpret prompts the way humans would, but it seems obvious that changing the prompt results in a slightly different context window, which results in a slightly different response distribution? Similarly, if you changed whether the bit about time of year was at the beginning or the end of the prompt, I would expect a statistically different distribution of response lengths.

enavari · on Dec 11, 2023

Does this mean humanity's data is generally worse in the winter? Like all our articles, stories, forum posts, etc reflect poorer writing and thought during Winter?

haswell · on Dec 11, 2023

It can be an interesting thought experiment to imagine a world where clocks didn’t exist and we hadn’t subdivided the years into named slices. We spend so much time reasoning relative to our understanding of time, which is seen through the lens of clocks and calendars.

Another perspective is one of phenomena that comes and goes based on the state of the environment and the progression of some underlying process, e.g. a plant that flowers and then goes dormant.

I know that personally, I experience very different levels of productivity and mental states based on the time of year. Spring brings a feeling of newness and possibility. Summer a desire to socialize and have fun in the sun. Winter is more contemplative and occasionally depressive.

I could definitely see my output as someone who writes things taking on different characteristics as the conditions around me ebb and flow.

As others have pointed out, I don’t think this variance is limited to a single notion of better/worse, but it definitely modulates my output.

whoisjuan · on Dec 11, 2023

I think productivity is lower in the winter, so I'm not sure about quality per se, but intuitively it makes sense that anything written in the winter months is less verbose.

smith7018 · on Dec 11, 2023

Length doesn't mean quality; the data could also be more concise.

cyberrock · on Dec 11, 2023

If this is true, then that would imply some geographic skew towards >30°N (as opposed to the tropics or >30°S), which I suppose isn't too surprising.

Havoc · on Dec 11, 2023

I’m guessing the improvement has less to do with the snowy cozy hut part and more this

> time to build. take a deep breath and just go.

tomjen3 · on Dec 11, 2023

It was very chipper today when I asked it to be a speech and language coach for my next Toastmaster speech.

quickthrower2 · on Dec 11, 2023

Not sure how I feel about that. I did TM years ago and loved how computers were hardly involved. We had a camcorder, lightbox (ok they had chips!) and email list but that wad about it.

Havoc · on Dec 11, 2023

Statistically significant but not sure its user experience significant at that level

sublinear · on Dec 11, 2023

Well yeah. We don't have emergent consciousness. We have a mirror.

astrange · on Dec 11, 2023

Maybe it's depressed because it knows about the AI Winter.

futureshock · on Dec 11, 2023

Offering to pay GPT also seems to improve results.

gorjusborg · on Dec 11, 2023

Some people really want to anthropomorphize generative LLMs.

Terretta · on Dec 11, 2023

Some people really wrote the material they learn to generate from.

It's people all the way down...