They are trained and evaluated on correctness benchmarks. But correctness on benchmark questions is only loosely coupled to correctness outside the benchmark, in part because LLMs aren't grounded to the same biological reality as humans. You can't easily convince an average person to cut off their own hand and this has little to do with higher-level thought. In contrast, it only takes a bit of creativity to convince an LLM to say or do almost anything.