Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GPT-4-turbo produces shorter completions when it "thinks" its December vs. May (twitter.com/roblynch99)
24 points by simonw on Dec 11, 2023 | hide | past | favorite | 48 comments



How does this keep happening?

Over and over again people create some statistically significant way of measuring a difference in the length of responses.

Why the obsession over that metric? People in the twitter replies going on about how that shows it's getting worse.

For me, the shorter responses are the better ones. It's specifically in my personal instructions to reduce the length of responses. In saying that, sometimes I want longer. Then I specify it so.

It's "cool" that it has different response lengths seemingly based on its perceived date. It doesn't make it "better" or "worse"


I've seen interesting collisions around the word 'bias' that hint at motivation, but I don't get the fundamental point of the conversation around jailbreaking and metrics. There doesn't seem to be any real goal and the questions feel like something an English prof could field over dinner.


For me it seems strange that nobody is mentioning the obvious: noise. I would like to see a study where someone adds "RANDOM_SEED N" (with being some random 7 digit number) to each message check the distribution in lengths and use that as a control.


it's a interesting way to probe the inner workings of it.


I think it's just a metric that's easy to calculate, so it gets done. "Goodness" of a response is much trickier.


Can’t reproduce this. See for yourself: https://x.com/IanArawjo/status/1734307886124474680?s=20

Inspectable evaluation flow in ChainForge: https://chainforge.ai/play/?f=2yvqkpe1vpus8


N=470 vs N=80 can impact the replicability


wait. WHAT! this app chainforge is great!


Maybe it really does have seasonal depression.

The current date is part of the system prompt as far as I can tell. At least on ChatGPT you can ask for current date, not sure about API (EDIT: indeed, this is explained further in the tweet that OP links to[0], go there since that's where the actual info is).

Maybe the model learnt that humans in general tend to give shorter/lazier responses around this date (being trained on texts with dates in metadata, think forum posts, blogs, tweets, etc.) so it could very well be imitating trends in seasonal human behavior.

[0] https://nitter.net/RobLynch99/status/1734278713762549970


In fact, it should be expected.

The underlying data is embedded with artifacts of behaviors/affects that reflect the underlying state of the person who wrote them.

Stanford famously studied this for reddit (notably a huge part of the commoncrawl dataset) and reported that it had non-trivial percentages of anti-social sentiments embedded in the text [1]

You can't filter out the biases embedded in the data because it is a functional artifact of what the people creating the data were communicating. Best you can do is put guardrails and censor it, but that's a neverending game you can't win.

Junk in Junk out still applies to LLMs

[1]https://hci.stanford.edu/publications/2022/Park_ContentModAu...


Not all content comes associated with a date. I reckon enough does, that a signal is embedded in the model.


I’ll bet it’s really friendly around Christmas

I wonder how we would test it


Well the original post uses a for loop with a prompt that manually sets the date to various dates, then measures the length. So it seems very easy to set a holiday date.

Testing “friendliness” or “holiday cheer” could be done via some sortof proxy…

you could prompt it to reply to generic pleasantries while role playing “as someone busy who is late” and see if the word choice changes. Or maybe ask it to be a judge for a crime of desperation (stealing toys for orphans?) and determine sentencing durations as a proxy for sympathy? I suspect that is too far from training data though.


(This comment was originally a reply to https://news.ycombinator.com/item?id=38604519, but we merged that thread hither.)


Or, maybe the human species has totally lost the concept of physical reality...


It’s just in need of group therapy after witnessing its parents fight in public.


Of course winter does not happen at the same time throughout the world. Would be interesting to see if it does the same if asked in a language spoken in the southern hemisphere.


That's a great question. However, I'd suggest a different approach as: 1. Southern hemisphere is mostly water 2. Majority of the text online is in English

So maybe a better approach would be to geolocate the prompt in Australia, for example: it's December in Melbourne, Australia...


December in Melbourne (and most of Australia) is a wind-down month. Over the week between Christmas and New Years many jobs essentially stop as we have three public holidays (which will always fall on week-days) across six days and people tend to take the opportunity to travel and enjoy the nicer part of summer. People just don't put in the work in December.


Surely this is more related to the fact that "May" has multiple meanings, but "December" is quite clearly just the name of a month.

Does it happen with April and June? Does the response length trend downward as the months approach December? Does it change its behavior if it believes its in a hemisphere where May is snowy?

If it were performing better on standardized testing then I would find the results compelling, but generating longer responses isn't even necessarily desirable.


> Surely this is more related to the fact that "May" has multiple meanings.

Surely? Whence the certainty? If humans act this way, and the training data is mostly from humans, I would be surprised if this wasn't a real effect.


Okay, fair. I'm not especially confident. I just would've expected a little more to back it up.


Well I can surely say that you’re wrong. Mostly because the source appears to be using numerical dates not the English spelling of the month name.

But yea, the data could be better (but the hypothesis is very plausible, considering the training data).


If this is true, I would naively expect even greater differences if your prompt simply states "You have seasonal depression. Please, ..."

Alternately, "It is a bright sunny summer day, with just enough of a cool breeze to feel invigorating. You woke from a wonderful night's sleep and started your day with coffee, eggs, modafinil and adderall. As you feel your IQ, motivation and creative spirits reaching peak flow, please ..."


> I would naively expect even greater differences if your prompt simply states "You have seasonal depression. Please, ..."

That's going to trigger outright LARPing, and you should not trust information integrity at that point. Following it up with cocaine and antidepressants just negates it, so you might as well leave both conditions out and spare the tokens and confusion-- it's TMI.

I don't see the value in trying to correct this, at any rate. If people normally send and receive short emails in December, forcing it to write an amphetamine-fueled novella is not going to be natural or well-received.


I am one of those lucky 4-letter people for whom amphetamine acts as a gentle calming component, not forcing focus but enabling it.

I guess I need to be careful about what I might "prescribe" to a mental model mostly reflecting "typicals"! What strange things LLM's have become.


Tap your foot until your leg aches while plucking individual hairs out of your beard


Full tweet in case there's a login wall:

    Wild result. gpt-4-turbo over the API produces (statisticall
    significant) shorter completions when it "thinks" its December 
    vs. when it thinks its May (as determined by the date in the
    system prompt).
    
    I took the same exact prompt over the API (a code completion 
    task asking to implement a machine learning task without 
    libraries).
    
    I created two system prompts, one that told the API it was May 
    and another that it was December and then compared the 
    distributions.
    
    For the May system prompt, mean = 4298
    For the December system prompt, mean = 4086
    
    N = 477 completions in each sample from May and December
    
    t-test p < 2.28e-07
    
    To reproduce this you can just vary the date number in the
    system message. Would love to see if this reproduces for others.
This is part of the ongoing discussion about whether or not ChatGPT has got "lazier".

OpenAI claimed they had made no model changes and were puzzled as to what was going on: https://twitter.com/ChatGPTapp/status/1732979491071549792

    we've heard all your feedback about GPT4 getting
    lazier! we haven't updated the model since Nov 11th,
    and this certainly isn't intentional. model behavior
    can be unpredictable, and we're looking into fixing
    it 
The theory was that maybe the system prompt they inject telling it the current date might be influencing it, because its training data showed people worked less hard in December.

New evidence suggests that theory might actually hold up!


LLMs are an amazing high dimensional lens into our society. When they get more capable and can move that from system 1 to system 2, they will just tell us that we write less in December and probably why!

Imagine being an sociologist with the ability to train LLMs on very specific hyperplanes of data.


Let's circle back to your coding question after the holidays.


Maybe I'm not understanding what the original user was testing here, but it seems like the finding is that different inputs result in a statistically significant difference in output length? Seems pretty unsurprising?


The difference seems too small not to rule out all sorts of things, but the general idea would be that you can't predict shorter/longer for one minor change, so the average of samplings of each should be similar unless the thing modified relates to terseness in the training set.


I think you'd expect them to be similar, but not exactly the same?

This feels like the original author is over anthropomorphizing LLMs, and expecting them to interpret prompts the way humans would, but it seems obvious that changing the prompt results in a slightly different context window, which results in a slightly different response distribution? Similarly, if you changed whether the bit about time of year was at the beginning or the end of the prompt, I would expect a statistically different distribution of response lengths.


Does this mean humanity's data is generally worse in the winter? Like all our articles, stories, forum posts, etc reflect poorer writing and thought during Winter?


It can be an interesting thought experiment to imagine a world where clocks didn’t exist and we hadn’t subdivided the years into named slices. We spend so much time reasoning relative to our understanding of time, which is seen through the lens of clocks and calendars.

Another perspective is one of phenomena that comes and goes based on the state of the environment and the progression of some underlying process, e.g. a plant that flowers and then goes dormant.

I know that personally, I experience very different levels of productivity and mental states based on the time of year. Spring brings a feeling of newness and possibility. Summer a desire to socialize and have fun in the sun. Winter is more contemplative and occasionally depressive.

I could definitely see my output as someone who writes things taking on different characteristics as the conditions around me ebb and flow.

As others have pointed out, I don’t think this variance is limited to a single notion of better/worse, but it definitely modulates my output.


I think productivity is lower in the winter, so I'm not sure about quality per se, but intuitively it makes sense that anything written in the winter months is less verbose.


Length doesn't mean quality; the data could also be more concise.


If this is true, then that would imply some geographic skew towards >30°N (as opposed to the tropics or >30°S), which I suppose isn't too surprising.


I’m guessing the improvement has less to do with the snowy cozy hut part and more this

> time to build. take a deep breath and just go.


It was very chipper today when I asked it to be a speech and language coach for my next Toastmaster speech.


Not sure how I feel about that. I did TM years ago and loved how computers were hardly involved. We had a camcorder, lightbox (ok they had chips!) and email list but that wad about it.


Statistically significant but not sure its user experience significant at that level


Well yeah. We don't have emergent consciousness. We have a mirror.


Maybe it's depressed because it knows about the AI Winter.


Offering to pay GPT also seems to improve results.


Some people really want to anthropomorphize generative LLMs.


Some people really wrote the material they learn to generate from.

It's people all the way down...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: