Rendered at 23:22:24 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
stephschie 11 hours ago [-]
Hmm, I'm not convinced that is the direction we want to go in. It's not like we have all the context of everything we ever learned present when making decisions. Heck, even for CPUs and GPUs we have strict hierachy of L1,L2,L3 shared, caches to larger memory units with constant management of those. Feel free to surprise me, but I believe having a similar stack for LLMs is the better way to go where we will have short-term memory (system-prompt, prompt, task), mid-term memory (session-knowledge, preferences), long-term memory (project knowledge, tech/stack insights), intuition memory (stemming from language, physics, rules). But right now we haven't developed best-practices yet of what information should go into what layer at what times. Increasing the overall context window is nice, but IMHO won't help us much.
user2722 10 hours ago [-]
I have a simple and brittle system to track people and facts and associations on Newspapers, which is basically: "LLM extract people, places/projects/structure/places and save them as an Obsidian compatible graph network."
For 2 or 3 newspapers it works; my idea was to use it as grounding to discover relationships between people, companies and jobs.
As for the "everyone's life", I have always assumed that there would be a graph system to point to "forgotten" documents.
Gemini said my idea was amazing and new in its implementation, even if not in spirit, but I'm assuming it was being sycophantic as usual.
asielen 3 hours ago [-]
Interesting, how do you use that system? I could see it useful for genealogy research.
altmanaltman 9 hours ago [-]
I always find it better to ask LLMs why this is bad and to explain itself why it thinks so. Sometimes it might hallicunate stuff but forcing it to find out the negatives is better than asking it for opinion since i am guessing they found early in training that an agreeable LLM is better received than one which is constantly truthful and considers you to be pretty dumb.
johnmaguire 9 hours ago [-]
> i am guessing they found early in training that an agreeable LLM is better received than one which is constantly truthful and considers you to be pretty dumb
My sense is that this is sort of accurate, but more likely it's a result of two things:
1. LLMs are still next-token predictors, and they are trained on texts of humans, which mostly collaborate. Staying on topic is more likely than diverging into a new idea.
2. LLMs are trained via RLHF which involves human feedback. Humans probably do prefer agreeable LLMs, which causes reinforcement at this stage.
So yes, kinda. But I'm not sure it's as clear-cut as "the researchers found humans prefer agreeableness and programmed it in."
user2722 5 hours ago [-]
I usually ask it to give me two batches of answer: 1) normal; 2) not sycophantic.
itissid 9 hours ago [-]
But, in context learning could be better. One important thing here also is the ability to align on what to more/less pay attention to — no matter the Knowledge Base. These are the highest leverage points that need to be exposed to a human to think and reason over. Constrained/Guardrailed development tasks work fine*, But exploration new direction — vs exploiting local minimas — is still an achiles-heel, even with all these knowledge unless there is sufficient steering and exploration the minima-seeking "tries" hard to win.
* With Claude's 1-million context window I have been doing some slightly longer range tasks — ~1-3 days of work — with RPI/QRSPI frameworks(see last few days of comments else where on HN) in one context window. They involve a grill-me session with 20-60 sometimes more questions for tasks to get alignment which produces the design and the plan in one window.
johnmaguire 9 hours ago [-]
> They involve a grill-me session with 20-60 sometimes more questions for tasks to get alignment which produces the design and the plan in one window.
My experience with this has been that it front-loads a lot of the LLM interactions, which can be exhausting without a reward (i.e. output.) And then, when I get the output, it's so large as to be hard to review/grok.
In other words, it feels a bit like when my coworker delivers me a month's worth of work in a single PR.
RugnirViking 8 hours ago [-]
"It's not like we have all the context of everything we ever learned present when making decisions."
We don't, no. But wouldn't it be great if we did? I'd sure love to be able to hold the entirety of the code of my organisations monolith in my head at once. It would make everything so much easier. It would definitely also cut down on the bugs I write!
Similar if I could recall all of my organisations confluence pages. Id probably be a lot better at my job. Same with all the slack history. All the hr documents, press releases, meeting transcripts. Theres practically no end to useful context even just in text form, and even if much of it is not relevant to any one task, having all of it in working memory would be fantastic, if only it were possible. I could probably make incredible cross organisational efficiencies and probably be far wealthier if I were some savant that could hold all of this in my head at once.
I get that we have agent harnesses to try and fetch only the relevant information. But most of the problems result in either failures in this process, or previous things falling out of context. I very rarely see failures where the agent forgets stuff already in context. The harnesses are making up for this exact limitation!
palmotea 7 hours ago [-]
> Similar if I could recall all of my organisations confluence pages. Id probably be a lot better at my job. Same with all the slack history. All the hr documents, press releases, meeting transcripts. Theres practically no end to useful context even just in text form, and even if much of it is not relevant to any one task, having all of it in working memory would be fantastic, if only it were possible. I could probably make incredible cross organisational efficiencies and probably be far wealthier if I were some savant that could hold all of this in my head at once.
That sounds like the beginning of a sci-fi story where the conclusion is forgetting is not such a bad thing.
bhouston 10 hours ago [-]
> Hmm, I'm not convinced that is the direction we want to go in. It's not like we have all the context of everything we ever learned present when making decisions.
I do not think it is the direction for everything.
Generally, we need consolidation of experiences and memories to just remember the important conclusions, ideas, and concepts, and then the ability to remember the full details if they are relevant (which they usually are not.)
But for some applications I am sure a billion token context would be useful.
It is likely most people need a 10 core CPU or whatever for most tasks, but for some applications you want a supercomputer with 1M cores.
lubujackson 9 hours ago [-]
I think we are wending toward a solution here for context, because no matter how big a context window is, there needs to be a way to navigate and prioritize that context, a way to handle contadictory info, etc.
So we need a taxonomy, we need memory layers, we need summary/details. If there is one thing I have learned about how these LLMs work, if you give them a few flexible tools they can work the shit out of them to achieve objectives. We just need to right tools and right structure for context.
lumost 10 hours ago [-]
Currently, it is difficult to live update the model’s parameters in response to new information. This difficulty applies at both an infrastructural level and an optimization level.
We simply don’t know how to incorporate new information without losing old capabilities reliably. Pans handle this through extensive evaluation, heuristics, and experience.
What we do know is that models can adapt to their context, and extending the context window is an infrastructure and capex problem first. A billion useful tokens would obviate the need for any out of band memory structures.
wat10000 9 hours ago [-]
I definitely see why effort is being put into this. But it seems inherently limiting. It's like having someone sit down in a library each day with a notebook containing all their prior work, none of which they can actually remember. At the end of the day, they write out their notes, then go home and get their memory wiped for the next day. Making that notebook longer is an obvious way to improve the system, but it seems like it's going to bump into fundamental limits.
bjourne 4 hours ago [-]
You can't compare context to memory. Context is simply all the text the LLM can use to generate a likely continuation. Imagine you're a relationship expert and I'm asking you for relationship advice. You don't know me so the best you can give me is "be yourself!" or "be confident!". It doesn't matter how good you are---lack of information about me is your limit. Now imagine you have a complete view of my dating history, including in-depth reviews from ex-girlfriends and whatnot. You could come up with some sharp and very fine-tuned advice just for me. Or maybe it still would be "be yourself!" cause dating advice is pseudoscience but you get my point.
Schlagbohrer 13 hours ago [-]
Amazing that they are trying to solve this with hardware rather than with a new software architecture but I suppose the current technology underlying LLM software must be far and away the best theoretically or most established, or the time taken to seek a new model isn't worth it for the big companies.
I know Yann LeCun is trying to do a completely different architecture and I think that's expected to take 2-3 years before showing commercial results, right? Is that why they're finding it quicker to change the hardware?
aurareturn 12 hours ago [-]
It is both a software and hardware problem. Software because you can train LLMs that get better at very large contexts. Hardware because no matter what you do in software, you still need faster and bigger chips.
Yann LeCunn has been very wrong in the past about LLMs.[0] The approach he wants to take is to train using sensor data in the physical world. I think it's going to fail because there's near infinite amount of physical data down to Schrodinger's equation on how particles behave. There's too much signal to noise. My guess is that they'll need magnitudes more compute to even get something useful but they do not have more compute than OpenAI and Anthropic. In other words, I think LLMs will generate revenue as a stepping stone for OpenAI and Anthropic such that they will be the ones who will ultimately train the AI that LeCunn dreams of.
I don't know. Some of those statements still look correct at the time they were made and then people found out how to work around them. I don't think anyone has shown his general assumption is wrong really. The issue is we don't know what the ceiling is for these things is yet because we haven't hit it. But that doesn't mean there is no ceiling.
aurareturn 12 hours ago [-]
His generation assumptions were wrong. That's the point.
zaphar 11 hours ago [-]
I haven't seen any indication that they are. Can you point me at some?
in-silico 6 hours ago [-]
People are trying to solve it with software too, even if you don't hear about it.
The most high-profile example is the latest set of Qwen models, which replace most of the attention mechanisms with Gated DeltaNet (which uses constant memory with respect to sequenc length).
Test-time training architectures are also getting a lot of attention, and have shown great performance in the acedemic setting. It's only a matter of time before we start getting open TTT models.
AntiUSAbah 13 hours ago [-]
Nvidia has so much money, it would be a waste if they wouldn't attack current problems on multiply points at once.
People, Researcher, Investor etc. probably also want to see what would be possible and someone has to do it.
I can also imagine, that an inferencing optimized system like this could split the context for different requests if it doesn't need to use the full context.
Could also be that they have internal use cases which require this amount of context.
sandworm101 10 hours ago [-]
[dead]
Schlagbohrer 13 hours ago [-]
What does this mean: "In addition, because most AI models are not trained uniformly across their maximum context length, their reasoning quality tends to degrade gradually near the limit rather than fail abruptly."
Models aren't trained across their context, their context is their short term memory at runtime, right? Nothing to do with training. They are trained on a static dataset.
anon373839 10 hours ago [-]
When you read technical papers on various models, you’ll find that they often did most of the pretraining and even the supervised fine tuning using relatively short context data; then they “extended” the context window by training on a little bit of long context data. I think this is what is meant by not being trained uniformly.
However, now that RL environments and long-horizon agentic performance have taken such a prominent role in model development, I wonder if that practice still holds. I know that the most recent Gemma and Qwen models are incomparably more reliable at long contexts than their predecessors, even though, e.g. Qwen already had a 256k context. It just didn’t work like it does now.
vessenes 11 hours ago [-]
Context is the vector of tokens (numbers) that goes into the first layers of the neural network.
When you train, you teach the model to, among other things ‘self attend’ to the input vector, ultimately projecting that vector into a large embedding space.
Thought experiment —- if 99% of the time the last 100,000 digits of your vector was zero, how likely is it that you’d have high quality embedding trained by doing gradient descent on those outputs?
That’s what the paper is referring to.
andai 13 hours ago [-]
Not sure how it is now, but a while back most of the training data was short interactions.
I noticed that the longer a chat gets, the more unpredictable the models behavior becomes (and I think that's still a common jailbreak technique too).
(I think it might also have something to do with RoPE, but that's beyond me.)
Jabrov 11 hours ago [-]
They absolutely are. The “maximum context window” of a model is a byproduct of the context length it was trained on.
If your model only ever sees 8K token samples during training, it won’t be as good at 128K context length than if you had trained on samples from 8 to 128
AntiUSAbah 12 hours ago [-]
So for the context to work well, you need some attention mechanism which makes sure that details are not getting lost due to context amount.
or lets say it differently: The LLM gets trained on static data but also on the capability of handling context in itself.
The attention residuals paper uses attention across layers for the same token, in addition to the usual case of attention across tokens within the same layer, but it doesn't do anything to address the "lost in too much context" problem. At least the number of layers is currently still low enough that there's probably no equivalent "lost in too many layers" problem yet.
AntiUSAbah 11 hours ago [-]
Seems you are right, i have to re-read a few things;
smallerize 12 hours ago [-]
I think it means most of the training data is short. And a lot of the long-context examples are conversations where the middle turns are less important.
alansaber 11 hours ago [-]
They mean input token quantity
gbnwl 12 hours ago [-]
[dead]
schnitzelstoat 14 hours ago [-]
Is such a large context window even desirable? It seems like it would consume an awful lot of tokens and, unless one was very careful to curate the context, could even result in worse performance.
AureliusMA 13 hours ago [-]
I remember when a large context was 8k! Nowadays that would seem extremely small, because we have new use-cases that require much larger context sizes. Maybe in the future, we will invent ways to use inference on very large contexts that we cannot even imagine today.
vessenes 11 hours ago [-]
Yes. That is, it is if you imagine a magically good self attention mechanism that could decide what in the context to attend to at any one moment. Then it would be like working with a polymath that has incredible memory. Or bringing in that aged but still senior Chief of Staff of a large company that knows where every body was buried, and why every decision was made at the time it was made, or a professor of film that has seen and can remember thousands of films.
Shockingly, we seem to have found a self attention mechanism of that quality, it just has the sad property of growing at O(N^2) where N is the context length.
AntiUSAbah 12 hours ago [-]
Thats either the R&D part of this chip or Nvidia has the use case.
Nvidia uses ML for finetuning and architecturing their chips. this might be one use case.
Another one would be to put EVERYTHING from your company into this context window. It would be easier to create 'THE' model for every company or person. It might also be saver than having a model train with your data because you don't have a model with all your data, only memory.
alansaber 11 hours ago [-]
Yes because scaling tends to pay off unexpectedly
faangguyindia 12 hours ago [-]
imagine if you were making a database software and u could fit source code of all existing databases and their github issues in context.
withinboredom 13 hours ago [-]
For larger codebases ... maybe it will cut down on "let me create a random number wrapper for the 15th time" type problems.
Weryj 13 hours ago [-]
You should already have skills which mention these utilities.
But maybe that’s enough tokens to feed an entire lifetime of user behaviour in for the digital twin dystopia?
withinboredom 13 hours ago [-]
"type problems" was doing the heavy lifting there, not literally "this utility".
__alexs 13 hours ago [-]
Does having 1 billion tokens mean more total tokens in the context window are actually good quality, or do we just get more dumb tokens?
RugnirViking 13 hours ago [-]
the article is almost entirely about this, yes.
Current approaches require fancy tricks to fit tokens into memory, and spread attention thinner over larger numbers of tokens. The new approach tries to find a way to keep everything in a single shared memory, and process the tokens in parallel using multiple GPUs
Havoc 12 hours ago [-]
Having it would be useful but I'd say long before you get there one should think about structuring the data in a more meaningful sense. Breaking tasks out into subagents etc.
For 2 or 3 newspapers it works; my idea was to use it as grounding to discover relationships between people, companies and jobs.
As for the "everyone's life", I have always assumed that there would be a graph system to point to "forgotten" documents.
Gemini said my idea was amazing and new in its implementation, even if not in spirit, but I'm assuming it was being sycophantic as usual.
My sense is that this is sort of accurate, but more likely it's a result of two things:
1. LLMs are still next-token predictors, and they are trained on texts of humans, which mostly collaborate. Staying on topic is more likely than diverging into a new idea.
2. LLMs are trained via RLHF which involves human feedback. Humans probably do prefer agreeable LLMs, which causes reinforcement at this stage.
So yes, kinda. But I'm not sure it's as clear-cut as "the researchers found humans prefer agreeableness and programmed it in."
* With Claude's 1-million context window I have been doing some slightly longer range tasks — ~1-3 days of work — with RPI/QRSPI frameworks(see last few days of comments else where on HN) in one context window. They involve a grill-me session with 20-60 sometimes more questions for tasks to get alignment which produces the design and the plan in one window.
My experience with this has been that it front-loads a lot of the LLM interactions, which can be exhausting without a reward (i.e. output.) And then, when I get the output, it's so large as to be hard to review/grok.
In other words, it feels a bit like when my coworker delivers me a month's worth of work in a single PR.
We don't, no. But wouldn't it be great if we did? I'd sure love to be able to hold the entirety of the code of my organisations monolith in my head at once. It would make everything so much easier. It would definitely also cut down on the bugs I write!
Similar if I could recall all of my organisations confluence pages. Id probably be a lot better at my job. Same with all the slack history. All the hr documents, press releases, meeting transcripts. Theres practically no end to useful context even just in text form, and even if much of it is not relevant to any one task, having all of it in working memory would be fantastic, if only it were possible. I could probably make incredible cross organisational efficiencies and probably be far wealthier if I were some savant that could hold all of this in my head at once.
I get that we have agent harnesses to try and fetch only the relevant information. But most of the problems result in either failures in this process, or previous things falling out of context. I very rarely see failures where the agent forgets stuff already in context. The harnesses are making up for this exact limitation!
That sounds like the beginning of a sci-fi story where the conclusion is forgetting is not such a bad thing.
I do not think it is the direction for everything.
Generally, we need consolidation of experiences and memories to just remember the important conclusions, ideas, and concepts, and then the ability to remember the full details if they are relevant (which they usually are not.)
But for some applications I am sure a billion token context would be useful.
It is likely most people need a 10 core CPU or whatever for most tasks, but for some applications you want a supercomputer with 1M cores.
So we need a taxonomy, we need memory layers, we need summary/details. If there is one thing I have learned about how these LLMs work, if you give them a few flexible tools they can work the shit out of them to achieve objectives. We just need to right tools and right structure for context.
We simply don’t know how to incorporate new information without losing old capabilities reliably. Pans handle this through extensive evaluation, heuristics, and experience.
What we do know is that models can adapt to their context, and extending the context window is an infrastructure and capex problem first. A billion useful tokens would obviate the need for any out of band memory structures.
I know Yann LeCun is trying to do a completely different architecture and I think that's expected to take 2-3 years before showing commercial results, right? Is that why they're finding it quicker to change the hardware?
Yann LeCunn has been very wrong in the past about LLMs.[0] The approach he wants to take is to train using sensor data in the physical world. I think it's going to fail because there's near infinite amount of physical data down to Schrodinger's equation on how particles behave. There's too much signal to noise. My guess is that they'll need magnitudes more compute to even get something useful but they do not have more compute than OpenAI and Anthropic. In other words, I think LLMs will generate revenue as a stepping stone for OpenAI and Anthropic such that they will be the ones who will ultimately train the AI that LeCunn dreams of.
[0]https://old.reddit.com/r/LovingAI/comments/1qvgc98/yann_lecu...
The most high-profile example is the latest set of Qwen models, which replace most of the attention mechanisms with Gated DeltaNet (which uses constant memory with respect to sequenc length).
Test-time training architectures are also getting a lot of attention, and have shown great performance in the acedemic setting. It's only a matter of time before we start getting open TTT models.
People, Researcher, Investor etc. probably also want to see what would be possible and someone has to do it.
I can also imagine, that an inferencing optimized system like this could split the context for different requests if it doesn't need to use the full context.
Could also be that they have internal use cases which require this amount of context.
Models aren't trained across their context, their context is their short term memory at runtime, right? Nothing to do with training. They are trained on a static dataset.
However, now that RL environments and long-horizon agentic performance have taken such a prominent role in model development, I wonder if that practice still holds. I know that the most recent Gemma and Qwen models are incomparably more reliable at long contexts than their predecessors, even though, e.g. Qwen already had a 256k context. It just didn’t work like it does now.
When you train, you teach the model to, among other things ‘self attend’ to the input vector, ultimately projecting that vector into a large embedding space.
Thought experiment —- if 99% of the time the last 100,000 digits of your vector was zero, how likely is it that you’d have high quality embedding trained by doing gradient descent on those outputs?
That’s what the paper is referring to.
I noticed that the longer a chat gets, the more unpredictable the models behavior becomes (and I think that's still a common jailbreak technique too).
(I think it might also have something to do with RoPE, but that's beyond me.)
If your model only ever sees 8K token samples during training, it won’t be as good at 128K context length than if you had trained on samples from 8 to 128
or lets say it differently: The LLM gets trained on static data but also on the capability of handling context in itself.
Kimi introduced this https://github.com/MoonshotAI/Attention-Residuals but i'm pretty sure closed labs like Google had something like this for a while.
Shockingly, we seem to have found a self attention mechanism of that quality, it just has the sad property of growing at O(N^2) where N is the context length.
Nvidia uses ML for finetuning and architecturing their chips. this might be one use case.
Another one would be to put EVERYTHING from your company into this context window. It would be easier to create 'THE' model for every company or person. It might also be saver than having a model train with your data because you don't have a model with all your data, only memory.
But maybe that’s enough tokens to feed an entire lifetime of user behaviour in for the digital twin dystopia?
Current approaches require fancy tricks to fit tokens into memory, and spread attention thinner over larger numbers of tokens. The new approach tries to find a way to keep everything in a single shared memory, and process the tokens in parallel using multiple GPUs