I do realize the benefits of the 'compression' model of ML. Sometimes you can even use compression directly, like here: https://arxiv.org/abs/cs/0312044 .
I suppose you're right that you only need a few substantively similar outputs to potentially get sued already. (depending on who's scrutinizing you).
While talking with you, it occurred to me that so far we've ignored the output set o, which is the set of all images output by -say- stable diffusion. n can then be defined as n = m ∩ o .
And we know m is much larger than n, and o is theoretically practically infinite [1] (you can generate as many unique images as you like) , so o >> m >> n . [2]
Really already at this point I think calling SD a compression algorithm might be just a little odd. It doesn't look like the goal is compression at all. Especially when the authors seem to treat n like a bug ('overfit'), and keep trying to shrink it.
That's before looking back at the "compression ratio" and "loss ratio" of this algorithm, so maybe in future I can save myself some maths. It's an interesting approach to the argument I might try more in future. (Thank you for helping me to think in this direction)
* I think in the case of the Getty lawsuit they might have a bit of a point, if the model might have been overfitted on some of their images. Though I wonder if in some cases the model merely added Getty watermarks to novel images. I'm pretty sure that will have had something to do with setting Getty off.
* I am deeply suspicious of the NYT case. There's a large chunk of examples where they used ChatGPT to browse their own website. This makes me wonder if the rest of the examples are only slightly more subtle. IIRC I couldn't replicate them trivially. (YMMV, we can revisit if you're really interested)
[1] However, in practice there appear to be limits to floating point precision.
I do realize the benefits of the 'compression' model of ML. Sometimes you can even use compression directly, like here: https://arxiv.org/abs/cs/0312044 .
I suppose you're right that you only need a few substantively similar outputs to potentially get sued already. (depending on who's scrutinizing you).
While talking with you, it occurred to me that so far we've ignored the output set o, which is the set of all images output by -say- stable diffusion. n can then be defined as n = m ∩ o .
And we know m is much larger than n, and o is theoretically practically infinite [1] (you can generate as many unique images as you like) , so o >> m >> n . [2]
Really already at this point I think calling SD a compression algorithm might be just a little odd. It doesn't look like the goal is compression at all. Especially when the authors seem to treat n like a bug ('overfit'), and keep trying to shrink it.
That's before looking back at the "compression ratio" and "loss ratio" of this algorithm, so maybe in future I can save myself some maths. It's an interesting approach to the argument I might try more in future. (Thank you for helping me to think in this direction)
* I think in the case of the Getty lawsuit they might have a bit of a point, if the model might have been overfitted on some of their images. Though I wonder if in some cases the model merely added Getty watermarks to novel images. I'm pretty sure that will have had something to do with setting Getty off.
* I am deeply suspicious of the NYT case. There's a large chunk of examples where they used ChatGPT to browse their own website. This makes me wonder if the rest of the examples are only slightly more subtle. IIRC I couldn't replicate them trivially. (YMMV, we can revisit if you're really interested)
[1] However, in practice there appear to be limits to floating point precision.
[2] I'm using >> as "much greater than"