Since most of the needle in a "haystack" test injects a line into a pre-defined book text (that can be part of the original dataset), it can be hypothesized that the LLM simply is "smelling" for something that does not fit the context.
So, is it possible to create a "haystack" that is a mix of multiple articles, or just a list of one-liners such that it cannot guess?