Over and over again people create some statistically significant way of measuring a difference in the length of responses.
Why the obsession over that metric? People in the twitter replies going on about how that shows it's getting worse.
For me, the shorter responses are the better ones. It's specifically in my personal instructions to reduce the length of responses. In saying that, sometimes I want longer. Then I specify it so.
It's "cool" that it has different response lengths seemingly based on its perceived date. It doesn't make it "better" or "worse"
I've seen interesting collisions around the word 'bias' that hint at motivation, but I don't get the fundamental point of the conversation around jailbreaking and metrics. There doesn't seem to be any real goal and the questions feel like something an English prof could field over dinner.
For me it seems strange that nobody is mentioning the obvious: noise. I would like to see a study where someone adds "RANDOM_SEED N" (with being some random 7 digit number) to each message check the distribution in lengths and use that as a control.
The current date is part of the system prompt as far as I can tell. At least on ChatGPT you can ask for current date, not sure about API (EDIT: indeed, this is explained further in the tweet that OP links to[0], go there since that's where the actual info is).
Maybe the model learnt that humans in general tend to give shorter/lazier responses around this date (being trained on texts with dates in metadata, think forum posts, blogs, tweets, etc.) so it could very well be imitating trends in seasonal human behavior.
The underlying data is embedded with artifacts of behaviors/affects that reflect the underlying state of the person who wrote them.
Stanford famously studied this for reddit (notably a huge part of the commoncrawl dataset) and reported that it had non-trivial percentages of anti-social sentiments embedded in the text [1]
You can't filter out the biases embedded in the data because it is a functional artifact of what the people creating the data were communicating. Best you can do is put guardrails and censor it, but that's a neverending game you can't win.
Well the original post uses a for loop with a prompt that manually sets the date to various dates, then measures the length. So it seems very easy to set a holiday date.
Testing “friendliness” or “holiday cheer” could be done via some sortof proxy…
you could prompt it to reply to generic pleasantries while role playing “as someone busy who is late” and see if the word choice changes. Or maybe ask it to be a judge for a crime of desperation (stealing toys for orphans?) and determine sentencing durations as a proxy for sympathy? I suspect that is too far from training data though.
Of course winter does not happen at the same time throughout the world. Would be interesting to see if it does the same if asked in a language spoken in the southern hemisphere.
That's a great question. However, I'd suggest a different approach as:
1. Southern hemisphere is mostly water
2. Majority of the text online is in English
So maybe a better approach would be to geolocate the prompt in Australia, for example: it's December in Melbourne, Australia...
December in Melbourne (and most of Australia) is a wind-down month. Over the week between Christmas and New Years many jobs essentially stop as we have three public holidays (which will always fall on week-days) across six days and people tend to take the opportunity to travel and enjoy the nicer part of summer. People just don't put in the work in December.
Surely this is more related to the fact that "May" has multiple meanings, but "December" is quite clearly just the name of a month.
Does it happen with April and June? Does the response length trend downward as the months approach December? Does it change its behavior if it believes its in a hemisphere where May is snowy?
If it were performing better on standardized testing then I would find the results compelling, but generating longer responses isn't even necessarily desirable.
If this is true, I would naively expect even greater differences if your prompt simply states "You have seasonal depression. Please, ..."
Alternately, "It is a bright sunny summer day, with just enough of a cool breeze to feel invigorating. You woke from a wonderful night's sleep and started your day with coffee, eggs, modafinil and adderall. As you feel your IQ, motivation and creative spirits reaching peak flow, please ..."
> I would naively expect even greater differences if your prompt simply states "You have seasonal depression. Please, ..."
That's going to trigger outright LARPing, and you should not trust information integrity at that point. Following it up with cocaine and antidepressants just negates it, so you might as well leave both conditions out and spare the tokens and confusion-- it's TMI.
I don't see the value in trying to correct this, at any rate. If people normally send and receive short emails in December, forcing it to write an amphetamine-fueled novella is not going to be natural or well-received.
Wild result. gpt-4-turbo over the API produces (statisticall
significant) shorter completions when it "thinks" its December
vs. when it thinks its May (as determined by the date in the
system prompt).
I took the same exact prompt over the API (a code completion
task asking to implement a machine learning task without
libraries).
I created two system prompts, one that told the API it was May
and another that it was December and then compared the
distributions.
For the May system prompt, mean = 4298
For the December system prompt, mean = 4086
N = 477 completions in each sample from May and December
t-test p < 2.28e-07
To reproduce this you can just vary the date number in the
system message. Would love to see if this reproduces for others.
This is part of the ongoing discussion about whether or not ChatGPT has got "lazier".
we've heard all your feedback about GPT4 getting
lazier! we haven't updated the model since Nov 11th,
and this certainly isn't intentional. model behavior
can be unpredictable, and we're looking into fixing
it
The theory was that maybe the system prompt they inject telling it the current date might be influencing it, because its training data showed people worked less hard in December.
New evidence suggests that theory might actually hold up!
LLMs are an amazing high dimensional lens into our society. When they get more capable and can move that from system 1 to system 2, they will just tell us that we write less in December and probably why!
Imagine being an sociologist with the ability to train LLMs on very specific hyperplanes of data.
Maybe I'm not understanding what the original user was testing here, but it seems like the finding is that different inputs result in a statistically significant difference in output length? Seems pretty unsurprising?
The difference seems too small not to rule out all sorts of things, but the general idea would be that you can't predict shorter/longer for one minor change, so the average of samplings of each should be similar unless the thing modified relates to terseness in the training set.
I think you'd expect them to be similar, but not exactly the same?
This feels like the original author is over anthropomorphizing LLMs, and expecting them to interpret prompts the way humans would, but it seems obvious that changing the prompt results in a slightly different context window, which results in a slightly different response distribution? Similarly, if you changed whether the bit about time of year was at the beginning or the end of the prompt, I would expect a statistically different distribution of response lengths.
Does this mean humanity's data is generally worse in the winter? Like all our articles, stories, forum posts, etc reflect poorer writing and thought during Winter?
It can be an interesting thought experiment to imagine a world where clocks didn’t exist and we hadn’t subdivided the years into named slices. We spend so much time reasoning relative to our understanding of time, which is seen through the lens of clocks and calendars.
Another perspective is one of phenomena that comes and goes based on the state of the environment and the progression of some underlying process, e.g. a plant that flowers and then goes dormant.
I know that personally, I experience very different levels of productivity and mental states based on the time of year. Spring brings a feeling of newness and possibility. Summer a desire to socialize and have fun in the sun. Winter is more contemplative and occasionally depressive.
I could definitely see my output as someone who writes things taking on different characteristics as the conditions around me ebb and flow.
As others have pointed out, I don’t think this variance is limited to a single notion of better/worse, but it definitely modulates my output.
I think productivity is lower in the winter, so I'm not sure about quality per se, but intuitively it makes sense that anything written in the winter months is less verbose.
Not sure how I feel about that. I did TM years ago and loved how computers were hardly involved. We had a camcorder, lightbox (ok they had chips!) and email list but that wad about it.