Rendered at 21:39:12 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
danpalmer 20 hours ago [-]
I'm glad that we're making progress towards a deeper understanding of what LLMs are inherently good at and what they're inherently bad at (not to say incapable of doing, but stuff that is less likely to work due to fundamental limitations).
There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.
What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.
locknitpicker 16 hours ago [-]
> There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions.
Not so long ago, this was how early adopters of LLM coding assistants claimed was the right way to use them in coding tasks: prompt to draft the outline, and then prompt to implement each function. There were even a few posts in HN on blogposts showing off this approach with terms inspired in animation work.
Sammi 13 hours ago [-]
In short, LLMs are pretty great at working at a single level of abstraction at a time.
You can go from the highest level and all the way down to the lowest level with LLMs, you just have to work at it iteratively one level at a time.
danpalmer 16 hours ago [-]
I'm not necessarily suggesting always getting down to literally the function level, although I think that gives you excellent quality control, but having a code-level understanding is clearly an important factor.
nullsanity 16 hours ago [-]
[dead]
p-e-w 16 hours ago [-]
> due to fundamental limitations
People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist, and many tasks that were claimed to be impossible for LLMs two years ago supposedly due to “fundamental limitations” (e.g. character counting or phonetics) are non-issues for them today even without tools.
aDyslecticCrow 8 hours ago [-]
> character counting
The models now whaste a vast amount of useless neurons memorising the character count the entire English language so that people can ask how many r's are in strawberry and check a tickbox in a benchmark.
The architecture cannot efficiently or consistently represent counting letters in words. We should never have forced trained them to do it.
This goes for other more important "skills" that are unsuited to tranformer models.
Most models can now do decent arithmetics. But if you knew how it has encoded that ability in its neurons then you would never ever ever ever trust any arithmetic it ever outputs, even in seems to "know" it (unless it called a calculator MCP to achieve it).
There are fundamental limitations, but we're currently brute forcing ourselves through problems we could trivially solve with a different tool.
p-e-w 7 hours ago [-]
> The models now whaste a vast amount of useless neurons memorising the character count the entire English language
No they don’t. They only need to know the character count for each token, and with typical vocabularies having around 250k entries, that’s an insignificant number for all but the tiniest LLMs.
aDyslecticCrow 4 hours ago [-]
In a very simplified view;
Those "tolkens" humans "count" are translated to a ~2048 (depends on model) floating point vector.
bird => {mamal, english, noun, Vertebrate, aviant} has one r but what if you make it 20% more "french". Is is still 1 r? That could be the word "bird" in french, or it could be a french speaking bird or a bird species common in france.
If nearest neibour distance to the vocabulary of every language makes the vector no longer map to "bird"; then the amount of rs' must change, using a series of trained conditional checks (with some efficiency where languages have some general spelling patterns).
That is such an unreasonable amount of compute, that it is likley faar cheaper, easier and more reliable to train the model to memorise the output:
{"MCP":"python", "content":"len((c for c in 'strawberry' if c='r'))"}
The attention mechanism allow LLMs to learn this kind of absurdly inefficient calculations. But we really shouldn't use LLMs where they're outperformed by trivial existing solutions.
topham 7 hours ago [-]
Nope. Tokens aren't what you think they are.
coldtea 14 hours ago [-]
>People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist
Some limitations are not rigorously demonstrated to be fundamental, but continuously present from the first early LLMs yes. Shouldn't the burden of proof be on those who say it can be done?
And some limitations are fundamental, and have been rigorously demonstrated, e.g.:
That paper’s abstract doesn’t carry its title, to put it mildly.
coldtea 11 hours ago [-]
What part of "Specifically, we define a formal world where hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate if used as general problem solvers. " doesn't carry the title, to ask mildly?
red75prime 5 hours ago [-]
As with all the works that use too broad a definition of an LLM they prove too much. This work defines an "LLM" as a computable function obtained by applying a finite number of steps of a generic algorithm to an initial computable function.
What they really prove is that it's impossible to extrapolate unconstrained non-continuous function from a finite subset of its values. Good for them, I guess.
It's like saying that the no free lunch theorems proves that LLMs can't be the best optimizers, while it proves (roughly) that the best optimizers don't exists. That is, even people aren't the best optimizers, but we manage somehow, so LLMs can too.
p-e-w 10 hours ago [-]
I don’t agree with that definition of “hallucination”, for starters.
MarkusQ 6 hours ago [-]
So substitute another phrase, if you prefer. It doesn't change the logic.
"Specifically, we define a formal world where bungling is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably bungle if used as general problem solvers."
red75prime 5 hours ago [-]
Their diagonalization argument applies to any system that uses finite training data. Calling such a system "LLM" is an (unintentional) red herring.
MarkusQ 4 hours ago [-]
Yeah. IMHO this is the more serious objection.
gus_massa 9 hours ago [-]
[dead]
dijit 16 hours ago [-]
Character counting remains a huge issue without tools.
Are you using only frontier models that are gated behind openai/anthropic/google APIs? Those use tools to help them out behind the scenes. It remains no less impressive, but I think we should be clear.
girvo 12 hours ago [-]
The literal best public models still fail to count characters consistently in practice so I’m not sure what you mean. It’s literally a problem we’re still trying to solve at work
outofpaper 11 hours ago [-]
What's amazing is that they even can fairly reliably appear to count characters. I mean we're talking about systems that infer sequences not character counters or calculators. They are amazing in unrelated ways and we need to accept this so we can use them effectively.
jameshart 8 hours ago [-]
I suspect character counting - counting small numbers in general in fact - is something that multimodal models will gradually learn through their visual capabilities. We have generative systems that are capable of generating an image of the word ‘strawberry’, and of counting how many strawberries are visible in an image; seems likely it’s possible for an LLM to ‘imagine’ what the word strawberry looks like and count the ‘Rs’ it can ‘see’.
girvo 10 hours ago [-]
Of course, they’re shockingly powerful, just in an incredibly “spiky” way
3form 11 hours ago [-]
Your comment, after removing the particulars, has a shape of:
People have an <opinion> which hasn't been rigorously proven, while <not rigorously proven counteropinion>.
As such, I am not sure what you're trying to achieve here.
3form 12 hours ago [-]
Is character counting actually not an issue anymore? Do you know somewhere where I can read more about this?
mrob 12 hours ago [-]
Character counting errors are a side effect of tokenization, which is a performance optimization. If we scaled the hardware big enough we could train on raw bytes and avoid it.
teiferer 10 hours ago [-]
No, tokenization is not the only reason. A next-word predictor has fundamentally a hard time executing algorithms, even as simple as counting.
mitthrowaway2 7 hours ago [-]
Counting is one of the algorithms that can be expressed by a RASP program, which transformers closely approximate.
MarkusQ 6 hours ago [-]
Close famously counts in horseshoes and hand grenades. Algorithms, just as famously, are a domain where off-by-one is still wrong.
danpalmer 15 hours ago [-]
This is kind of my point, we need to get better at describing the limitations and study them. It seems extremely clear that there are limitations, and not just temporary ones, but structural limitations that existed at the beginning and continue to persist.
ijidak 12 hours ago [-]
Yeah I think it was the word "fundamental" he took issue with.
Marazan 13 hours ago [-]
If you remove the auxiliary tools and just leave the core LLM then strawberry still has an undefined number of `r`s in it.
p-e-w 13 hours ago [-]
That’s false. Larger LLMs learn token decompositions through their training, and in fact modern training pipelines are designed to occasionally produce uncommon tokenizations (including splitting words into individual characters) for this reason. Frontier models have no trouble spelling words even without tools. Even many mid-sized models can do that.
kilpikaarna 12 hours ago [-]
Wait, where can I learn more about this? I don't doubt that varying the tokenization during training improves results, but how does/would that enable token introspection?
p-e-w 10 hours ago [-]
Because LLMs can learn that different token sequences represent the same character sequence from training context. Just like they learn much more complex patterns from context.
You can try this out locally with any mid-sized current-gen LLM. You’ll find that it can spell out most atomic tokens from its input just fine. It simply learned to do so.
raincole 12 hours ago [-]
Drawing five fingered humans was a fundamental limitation... until it's not.
rimliu 15 hours ago [-]
of course, if you choose to ignore all the limitations they indeed have no limitations.
mkbosmans 15 hours ago [-]
Nobody says they have no limitations. The question is are those limitation fundamental, i.e. can we expect improvement, say within a year.
danpalmer 15 hours ago [-]
When I talk about fundamental limitations, I mean limitations that can't be solved, even if they could be improved.
We have improved hallucinations significantly, and yet it seems clear that they are inherent to the technology and so will always exist to some extent.
p-e-w 13 hours ago [-]
“Seems clear” based on what?
pegasus 13 hours ago [-]
For one, based on continuously frustrated hopes (and promises!) that hallucinations will go away.
coldtea 14 hours ago [-]
As a general architecture, an LLM also has limitations that can't be improved unless we switch to another, fundamentally different AI design that's non LLM based.
There are also limitations due to maths and/or physics that aren't fixable under any design. Outside science fiction, there is no technology whose limitations are all fixable.
Am I misreading that paper? They define hallucinations as anything other than the correct answer and prove that there are infinitely many questions an LLM can't answer correctly, but that's true of any architecture- there are infinitely many problems a team of geniuses with supercomputers can't answer. If an LLM can be made to reliably say "I don't know" when it doesn't, hallucinations are solved- they contend that this doesn't matter because you can keep drawing from your pile of infinite unanswerable questions and the LLM will either never answer or will make something up. Seems like a technically true result that isn't usefully true.
IdiotSavage 13 hours ago [-]
> Transform this image into a photographed claymation diorama of assorted artisan chocolates and candies […] viewed from a low-angle
Side note: whenever I read prompts for image generation, I notice very specific details which the model obviously ignored. Here the chocolates / candies in the last two images look anything but artisanal. They look very "sterile" and mass-produced. The viewing angle is also not accurate.
Why do we even bother writing such elaborate prompts, when the model ignores most of it anyway?
jameshart 8 hours ago [-]
I loved the example where he requested ‘studio lighting’ and it put a bunch of studio lights in the picture.
bonesss 6 hours ago [-]
The candies aren’t trying to look artisanal, they’re trying to match training data marketed and labelled by companies as artisanal.
Rustic, homemade, amateur, etc might align better with the tagging.
8-prime 12 hours ago [-]
I have noticed the same thing.The few times I wanted to use image generatation it always failed me in exactly these aspects. I always put if off as a lack of prompting skill on my end. Once you start to keep an eye out for these inconsistencies they turn out to be very common.
ryanthedev 12 hours ago [-]
I believe most detailed prompts are AI generated.
vunderba 5 hours ago [-]
This is 100% true. There are entire nodes/pipelines in programs like ComfyUI that are designed to take a simple prompt and "enhance it" which usually involves making it more verbose, adding detail, etc depending on the target model.
Original Prompt: "Man with Trapezoid Head"
AI Expansion:
Portrait of a man with a trapezoid-shaped head, sharp geometric facial structure, angular jawline wider at the top and narrowing toward the chin, realistic skin texture, detailed pores, dramatic studio lighting, ultra-detailed, 85mm lens, shallow depth of field, dark neutral background, cinematic, photorealistic, 8k resolution.
Note: Most people (outside the generative space) won’t pick up on this but in many cases if don't prompt otherwise, you’ll often end up with a prompt that’s better suited to older, keyword‑based models like Stable Diffusion which rely heavily on specific sets of positive and negative prompt keywords more akin to magical incantations to improve the output.
IdiotSavage 3 hours ago [-]
Yes, this is exactly the kind of prompt I often see. Then you have stuff like "8k resolution". WTF? The output is fixed anyway.
IdiotSavage 11 hours ago [-]
That's funny if it's true. I'd like to see the prompt which generates the prompt.
uncircle 7 hours ago [-]
[dead]
ErroneousBosh 12 hours ago [-]
I wonder how long it took to come up with all this?
Because if I wanted a spiral of little "buttons" like the last one at the end (and they don't look very much like sweets) I'd be able to knock that out in Blender in an afternoon, and I'm not very good at Blender.
HotHotLava 12 hours ago [-]
I think you're vastly overestimating the average persons ability to use Blender if you can do that in an afternoon; just figuring out how to place a colored cube and the camera probably takes an afternoon if you pick up Blender for the first time.
spookie 10 hours ago [-]
And knowing these little tricks to get what you want with image generation models also takes time. Not to mention you need some knowledge on some other software just to make the underlying layout.
mrec 4 hours ago [-]
Yeah, I've bounced off Blender twice now. And I've written a (basic) 3D modeller.
I think part of the problem is that pretty much all the tutorial material for Blender seems to be in video form, which is easily my least effective way to learn, even leaving aside the "I've only got one screen" issue.
PaulHoule 8 hours ago [-]
I'd just do it with Pillow.
ErroneousBosh 11 hours ago [-]
I guess I'm coming at it from having used Blender for an afternoon or so, and already knowing Python.
If you were good at GLSL you could do it in that maybe.
Someone somewhere is going to write something that directly draws it to a framebuffer in Brainfuck, you just know it, don't you?
samcollins 5 hours ago [-]
OP here. It took me an afternoon to try different methods and test the limits. But now we know how it works it’s very fast to create new ones:
1. Prompt to make SVG - review in browser, iterate.
2. Prompt to write image prompt - review in editor, refine
3. Send to Gemini, get image
So maybe 5-10 mins.
I don’t know how to use Blender.
Also this method can be done over WhatsApp/telegram which is another plus over Blender type approach.
Brendinooo 10 hours ago [-]
I remember opening Blender for the first time years ago and thinking it had the steepest learning curve of any software I'd ever used.
sebastianmestre 6 hours ago [-]
It's not perfect, but it's been vastly improved in recent years. If you lost interest in 3D art because of Blender's bad UX in the past, I recommend you give it another shot.
Also, there might be other new 3D software with better UX. I am not a Blender fanboy, but I do love 3D art and graphics programming and want as many people as possible to get into it :^)
ErroneousBosh 3 hours ago [-]
Yeah somewhere between 2 and 3 it got a very much improved UI.
samcollins 3 days ago [-]
I found a simple technique to get reliable text and numbers in AI generated images.
I’m surprised the image models aren’t already doing this, so wanted to share since I’m finding this so useful
bsenftner 7 hours ago [-]
In some ways, this is similar to use of a Control Net. I've been doing this same technique for a while, using only SVGs as the base image. Works well.
jere 7 hours ago [-]
Very impressive, simple, and reliable. I'm sure it will be picked up by image generation labs soon.
choppaface 8 hours ago [-]
Isn’t this sort of just “chain of thought” (i.e. the seminal https://arxiv.org/abs/2201.11903 ) where the user is helping the model 1-shot or k-shot the solution instead of 0-shot? I’ve used a similar technique to great effect. I feel things are so new / moving so fast that it’s hard to have common lingo. So very helpful to have a blog / example! But I wonder if the phenomena has been seen / understood before and just in smaller circles / different name.
samcollins 20 hours ago [-]
TLDR: use SVG to outline image correctly first, then send that image with your text prompt to get Gemini 3.0 Pro to render with correct numbers and text
smusamashah 18 hours ago [-]
This is just img2img where first image with correct structure was generated by code.
vunderba 16 hours ago [-]
Yup, that’s exactly what this is. If you’ve been using generative models since the early Stable Diffusion days, it’s a pretty common (and useful!) technique: using a sketch (SVG, drawn, etc) as an ad-hoc "controlnet" to guide the generative model’s output.
Example: In the past I'd use a similar approach to lay out architectural visualizations. If you wanted a couch, chair, or other furniture in a very specific location, you could use a tool like Poser to build a simple scene as an approximation of where you wanted the major "set pieces". From there, you could generate a depth map and feed that into the generative model, at the time SDXL, to guide where objects should be placed.
jasonjmcghee 17 hours ago [-]
Pretty much what the author said- just gave some context for the uninitiated
philsnow 17 hours ago [-]
Right, but you can use a different (codegen) model to make that code.
xigoi 15 hours ago [-]
The standard objection: if the LLM is supposedly intelligent, why can’t it figure out on its own that this two-step process would achieve a better result?
petercooper 10 hours ago [-]
Because image models at the basic level are just text tokens in, image tokens out. You'd need an agentic process on top to come up with a strategy, review output, try again, and so on.
I believe Nano Banana and gpt-image-2 have a little of this going on, but it's like asking a model to one-shot some code vs having an agentic harness with tools do it. Even the most basic agent can produce better code than ChatGPT can.
Sharlin 12 hours ago [-]
Because the LLM is more or less hardcoded to just pass "create image" style prompts to a separate model, possibly with some embellishment.
pyrolistical 14 hours ago [-]
You don’t know what you don’t know
airstrike 8 hours ago [-]
They are not, in fact, intelligent.
nine_k 15 hours ago [-]
Nobody asked it to!
xigoi 15 hours ago [-]
If it’s asked to generate an image, it should to everything in its powers to make the image good.
andruby 14 hours ago [-]
> it should do everything in its powers
That's a scary thought.
Hey Claude, why haven't you finished yet? ... Because the human I'm holding hostage hasn't finished the drawing yet.
lacksjoian 13 hours ago [-]
LLMs have no concept of what makes the output "good".
Or to put it another way, if the LLM generates an image with jumbled numbers it's because that was the most likely output, hence it was a "good" image according to its weights.
cubefox 15 hours ago [-]
Part of the problem is that it isn't the LLM making the image directly itself, it's the LLM repeatedly prompting edits for a separate edit diffusion model. The Gemini reasoning summary shows part of this. The style of some of the images makes it also clear that it uses an Imagen 4 derived diffusion model underneath.
jstanley 15 hours ago [-]
[flagged]
xigoi 15 hours ago [-]
Every decent human artist knows to draw a sketch before painting something.
hirako2000 14 hours ago [-]
Humans even have the creativity to come up with sketching.
Models don't have intelligence, even less so creative thinking.
xigoi 13 hours ago [-]
Exactly, that’s my point.
jrapdx3 14 hours ago [-]
Of course many, even most, painters do sketch what they intend to paint, likely that's the predominant technique.
But it's not universally true, particularly among artists working in the last 100 years or so. Certainly Jackson Pollock (whether one regards his work as good or not) didn't sketch out how he was going to distribute paint onto canvas. Another example is Morris Luis (and other "stain painters") who didn't sketch out how he applied paint to canvas.
You're comment is largely correct, just pointing out that more than a few "decent artists" didn't (or don't) work that way.
sparuchuri 3 days ago [-]
This hack definitely falls in the “duh, why didn’t I think of that” category of tricks, but glad to now have it next time imagegen comes up short
manmal 17 hours ago [-]
Even the original stable diffusion app had image 2 image. It just didn’t work as well. I‘m not sure why this is supposed to be novel.
ludwik 17 hours ago [-]
It’s obviously not a new model capability. But using this well-known, existing capability to solve this particular issue is only obvious after the fact.
It’s a useful trick to have in one’s toolbox, and I’m grateful to the author for sharing it.
Finbel 17 hours ago [-]
It's not novel in the sense that nobody knew about img2img. It's novel in the sense that nobody thought of using img2img to solve this problem in this way.
TeMPOraL 13 hours ago [-]
It's novel if you never played with img2img, including especially several forms of (text+img)2img. Or, if you never tried editing images by text prompt in recent multimodal LLMs.
That said, I spent plenty of time doing both, and yet it would probably take me a while to arrive at this approach. For some reason, the "draw a sketch, have a model flesh it out" approach got bucketed with Stable Diffusion in my mind, and multimodal LLMs with "take detailed content, make targeted edits to it". So I'm glad the OP posted it.
vunderba 5 hours ago [-]
They’re actually quite good at it. I’ve had a number of situations where I’ve wanted to re-render some of my older comics. You can basically tell any SOTA multimodal model (NB, GPT-Image-X) to treat them as storyboards and prompt for a specific style: newprint, crosshatching, monochromatic ink sketch, etc.
Another thing I’ve gotten very used to doing is avoiding the “one-shot” approach. If I generate something and don’t like the results, I bring it into Krita, move things around, redraw some elements, and then send it back in with instructions to just clean it up (remove any smudges or imperfections). The state-of-the-art models can do an astonishing job with that workflow.
Ok it might just be me then. I view Nvidia‘s DLSS as a similar thing. There was even this meme that video games will in the future only output basic geometry and the AI layer transforms it into stunning graphics.
Geonode 13 hours ago [-]
We've been doing this for a long time now, it's similar to using a depth map or a line drawing to control the silhouette.
utopiah 12 hours ago [-]
Love the concluding note : it works, but not really.
So LLM/GenAI crave. An entire article to show that it's nearly there, yet it's not, despite convoluted effort to make it just so on a very very niche example.
Al-Khwarizmi 12 hours ago [-]
But if it works part of the time, it's useful. It's easy for a human to check that the numbers are correct, and if they aren't, just regenerate the image. Orders of magnitude easier than creating the image from scratch without the model.
dllu 15 hours ago [-]
I was thinking about doing the opposite for the common task of "SVG of a pelican riding a bike". Obviously, directly spitting out the SVG is gonna be bad. But image gen can produce a really stunning photorealistic image easily. Probably a good way to get an LLM to produce a decent bike-pelican SVG is to generate an image first and then get the model to trace it into an SVG. After all, few human beings can generate SVG works of art by just typing out numbers into Notepad. At the core of it, we still rely on looking at it and thinking about it as an image.
oh_no 7 hours ago [-]
interesting that GPT Image-2 managed to 2-shot this with thinking turned on, I didn't save a copy and it disappeared from my window but I first got a failure very similar to the one in the article, but it saw the issue and said it was going to use a reference image, after which it came out with https://i.imgur.com/hlWpQNT.jpeg
npilk 6 hours ago [-]
Still missing 49 - humans are safe, for now!
petercooper 10 hours ago [-]
This seems analogous to how a human would do it accurately. If you asked an artist to paint stones in a large circular arrangement with the numbers in order in one shot, with no fixes or sketching allowed, it wouldn't be surprising to end up with problems in the arrangement.
teiferer 10 hours ago [-]
I hope this kind of stuff puts the idea to rest that we're close to actual AGI. Outsourcing this kind of basic stuff which a real intelligence would be able to do "internally" is a hack which works for this specific case but would prevent further generalizations of the task at hand.
But I'm forseeing the opposite. This kind of tool use will soon be integrated and hidden such that people will eventully say "see we solved the problem that AI can't do 123+456, now we are really really close to AGI. Yeah no, with an AGI, it would have been the AGI itself that would have come up with needing at tool, building the tool and then using the tool. But that's not what LLMs are. They are statistical machines to predict tokens. They are very good at it, but that's not an AGI.
kfarr 6 hours ago [-]
I work on a platform 3dstreet.com that does “underdrawing” but in 3d space which image models also struggle with. Another company intangible.ai does this as well: low poly 3d then image to image model.
It seems to be a very effective pattern. Curious if there are other examples out there. Or other names for this?
elil17 14 hours ago [-]
I wonder whether this could be used to fine-tune image models to provide better outputs. Something like this:
1. Algorithmically generate a underdrawing (e.g. place numbers and shapes randomly in the underdrawing)
2. Algorithmically generate a description of the underdrawing (e.g. for each shape, output text like "there is a square with the number three in the top left corner). You might fuzz this by having an LLM rewrite the descriptions in a variety of ways.
3. Generate a "ground truth" image using the underdrawing and an image+text-to-image model.
4. Use the generated description and the generated "ground truth" image as training data for a text-to-image model.
vunderba 5 hours ago [-]
This is closer to a world model - kind of similar to how one might use a realistic or semi‑realistic simulation engine to model the environment like GTA in order to train a self-driving model.
hirako2000 14 hours ago [-]
That would complexity the architecture of a model, to solve a finite set of cases. That's an argument for specialised/fine tuned models though.
slickytail 13 hours ago [-]
[dead]
nottorp 15 hours ago [-]
LLMs are like a box of chocolates...
barbazoo 4 hours ago [-]
I've had a lot of success at work breaking down tasks that are supposed to be "done by the agent" into small LLM calls orchestrated deterministically via boring queues and messages. That's why this really resonates with me in a world where we're lured deep into the ecosystem by the model vendors.
At the end of the day we can get so much done just by breaking down a problem into smaller problems.
mncharity 2 hours ago [-]
> deterministically
But won't it be fun when we can cloud burst-parallel a grid/tiled sampling of multiple code implementations/architectures, and interactively explore navigate/blend-points-in the latent design space. Multiples embodying different trade-offs, styles, clarity vs performance, etc. Code as generative art. What might the software engineering equivalent of designer moodboards be?
BobbyTables2 18 hours ago [-]
How is it that LLMs aren’t good at rendering the sequence of numbers but can reliably put the supplied pieces all in the right order?
mk_stjames 18 hours ago [-]
Because the image generation is powered by a diffusion model that is only guided by the transformer model and still has somewhat vague spatial representation especially when it comes to coupling things like counting and complex positioning.
But by using the LLM to generate code like an SVG graphic is made up of, and then using a rasterized image of that SVG as an input to the diffusion model, this takes place of the raw noise input and guides the denoising process of the diffusion model to put the numerical parts in the right spots.
The LLM is putting the SVG in the right order because the code that drives the SVG is just that - code - and the numerical order is easily defined there, even if it has to follow something like a spiral.
Edit: although LLMs now also may be using thinking modes with their feedback during generation to help with complex positioning when drawing something like an SVG, as I just asked claude to generate me one such spiral number SVG and it did so interactively via thinking, and the code generated is incredibly explicit with positions, so, that must help. But the underlaying idea to two-step SVG-to-diffusion model is the real key here.
nine_k 15 hours ago [-]
It's normal to first create a plan, then allow agents to write code. But it seems to be surprising for many to first create a draft / outline of a picture, then go for a final render.
wg0 16 hours ago [-]
Has anyone had good luck with making consistent game art and assets?
choeger 17 hours ago [-]
Transformers are great translators. So, yeah, starting with structured output like SVG is probably the best way to start.
It should be fairly trivial to fix any logic errors in the structured output, too.
SomaticPirate 15 hours ago [-]
inb4 this technique is subsumed into the next MoE model release
LLMs are evolving so fast I wouldn’t be surprised if this technique was not needed in <6 months
krackers 15 hours ago [-]
I don't think the MoE part has anything to do with it, but the current gen of multimoddal models can do thinking interleaved with autoregressive(?*) image-gen so it's probably not long before they bake this into the RL process, same way native thought obviated need for "think carefully step by step" prompts.
rimliu 15 hours ago [-]
LLMs are rather devolving at this point.
18 hours ago [-]
cheekyant 14 hours ago [-]
Has anyone built a platform which has image to image pipelines and lets you use prompt to SVG generation from SOTA LLMs?
TeMPOraL 13 hours ago [-]
ComfyUI?
tracerbulletx 20 hours ago [-]
Ive been doing charts for slides like this for a while. Noticed html viz was super reliable, but I could style it with diffusion model. Its very useful for data viz.
docheinestages 13 hours ago [-]
And what happens if the model can't come up with a good enough SVG to begin with?
igtztorrero 8 hours ago [-]
Task like this, are the cause IA is eating memory and cpu prices.
Melamune 15 hours ago [-]
I wondered why I was losing all passion for creating.
These tips and tricks are part of the answer.
globular-toast 14 hours ago [-]
Wait, where did it get the "Sweet Path//Trail of treats" thing from in the SVG? It wasn't about sweets at that point. Something missing here, I think.
jeffrallen 17 hours ago [-]
I wish the opposite was true: that when I tell Gemini I want "a diagram of X" that it immediately breaks out Python and mathplotlib, instead of wasting my time with Nano Banana.
globular-toast 8 hours ago [-]
Like many AI things, it would have been considerably easier just to learn to edit images in GIMP or something. Instead of learning a valuable skill, you spent time working with a model that will be obsolete in a few months. Sunken cost fallacy, I guess.
nullc 18 hours ago [-]
Inpainting/guiding from a sketch is how I've always used diffusion models. I thought everyone did that, or at least everyone who wasn't just trying to get some arbitrary filler material without much care of what the output looked like.
foxes 14 hours ago [-]
I feel sorry for the recipient.
psychoslave 15 hours ago [-]
A few months ago I tried to make Le-chat Mistral output a French poetry in Alexandrin (12 vowels). Disastrous at first. Then adding in specifications that each line had to also be transposed in IPA and each syllable counted, it went better.
Still emotionally unrelatable, but definitely was providing something that match the specifications of there are explicit and systematically enforced through deterministitic means. For now I retain that LLM limitations are thus that they can't seize the ineffable and so untrustworthy they can only be employed under very clear and inescapable constraints or they will go awry just as sure as water is wet.
legalmoneytalk 4 hours ago [-]
[flagged]
gwern 19 hours ago [-]
tldr: do a standard img2img workflow where you lay out a skeleton or skeleton or low-res version, and then turn it into the final high-quality photorealistic version, instead of trying to zeroshot it purely from a text prompt.
There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.
What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.
Not so long ago, this was how early adopters of LLM coding assistants claimed was the right way to use them in coding tasks: prompt to draft the outline, and then prompt to implement each function. There were even a few posts in HN on blogposts showing off this approach with terms inspired in animation work.
You can go from the highest level and all the way down to the lowest level with LLMs, you just have to work at it iteratively one level at a time.
People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist, and many tasks that were claimed to be impossible for LLMs two years ago supposedly due to “fundamental limitations” (e.g. character counting or phonetics) are non-issues for them today even without tools.
The models now whaste a vast amount of useless neurons memorising the character count the entire English language so that people can ask how many r's are in strawberry and check a tickbox in a benchmark.
The architecture cannot efficiently or consistently represent counting letters in words. We should never have forced trained them to do it.
This goes for other more important "skills" that are unsuited to tranformer models.
Most models can now do decent arithmetics. But if you knew how it has encoded that ability in its neurons then you would never ever ever ever trust any arithmetic it ever outputs, even in seems to "know" it (unless it called a calculator MCP to achieve it).
There are fundamental limitations, but we're currently brute forcing ourselves through problems we could trivially solve with a different tool.
No they don’t. They only need to know the character count for each token, and with typical vocabularies having around 250k entries, that’s an insignificant number for all but the tiniest LLMs.
Those "tolkens" humans "count" are translated to a ~2048 (depends on model) floating point vector.
bird => {mamal, english, noun, Vertebrate, aviant} has one r but what if you make it 20% more "french". Is is still 1 r? That could be the word "bird" in french, or it could be a french speaking bird or a bird species common in france.
If nearest neibour distance to the vocabulary of every language makes the vector no longer map to "bird"; then the amount of rs' must change, using a series of trained conditional checks (with some efficiency where languages have some general spelling patterns).
That is such an unreasonable amount of compute, that it is likley faar cheaper, easier and more reliable to train the model to memorise the output:
{"MCP":"python", "content":"len((c for c in 'strawberry' if c='r'))"}
The attention mechanism allow LLMs to learn this kind of absurdly inefficient calculations. But we really shouldn't use LLMs where they're outperformed by trivial existing solutions.
Some limitations are not rigorously demonstrated to be fundamental, but continuously present from the first early LLMs yes. Shouldn't the burden of proof be on those who say it can be done?
And some limitations are fundamental, and have been rigorously demonstrated, e.g.:
https://arxiv.org/abs/2401.11817?utm_source=chatgpt.com
What they really prove is that it's impossible to extrapolate unconstrained non-continuous function from a finite subset of its values. Good for them, I guess.
It's like saying that the no free lunch theorems proves that LLMs can't be the best optimizers, while it proves (roughly) that the best optimizers don't exists. That is, even people aren't the best optimizers, but we manage somehow, so LLMs can too.
"Specifically, we define a formal world where bungling is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably bungle if used as general problem solvers."
Are you using only frontier models that are gated behind openai/anthropic/google APIs? Those use tools to help them out behind the scenes. It remains no less impressive, but I think we should be clear.
People have an <opinion> which hasn't been rigorously proven, while <not rigorously proven counteropinion>.
As such, I am not sure what you're trying to achieve here.
You can try this out locally with any mid-sized current-gen LLM. You’ll find that it can spell out most atomic tokens from its input just fine. It simply learned to do so.
We have improved hallucinations significantly, and yet it seems clear that they are inherent to the technology and so will always exist to some extent.
There are also limitations due to maths and/or physics that aren't fixable under any design. Outside science fiction, there is no technology whose limitations are all fixable.
Here's one: https://arxiv.org/abs/2401.11817?utm_source=chatgpt.com
Side note: whenever I read prompts for image generation, I notice very specific details which the model obviously ignored. Here the chocolates / candies in the last two images look anything but artisanal. They look very "sterile" and mass-produced. The viewing angle is also not accurate.
Why do we even bother writing such elaborate prompts, when the model ignores most of it anyway?
Rustic, homemade, amateur, etc might align better with the tagging.
Because if I wanted a spiral of little "buttons" like the last one at the end (and they don't look very much like sweets) I'd be able to knock that out in Blender in an afternoon, and I'm not very good at Blender.
I think part of the problem is that pretty much all the tutorial material for Blender seems to be in video form, which is easily my least effective way to learn, even leaving aside the "I've only got one screen" issue.
If you were good at GLSL you could do it in that maybe.
Someone somewhere is going to write something that directly draws it to a framebuffer in Brainfuck, you just know it, don't you?
1. Prompt to make SVG - review in browser, iterate.
2. Prompt to write image prompt - review in editor, refine
3. Send to Gemini, get image
So maybe 5-10 mins.
I don’t know how to use Blender.
Also this method can be done over WhatsApp/telegram which is another plus over Blender type approach.
Also, there might be other new 3D software with better UX. I am not a Blender fanboy, but I do love 3D art and graphics programming and want as many people as possible to get into it :^)
I’m surprised the image models aren’t already doing this, so wanted to share since I’m finding this so useful
Example: In the past I'd use a similar approach to lay out architectural visualizations. If you wanted a couch, chair, or other furniture in a very specific location, you could use a tool like Poser to build a simple scene as an approximation of where you wanted the major "set pieces". From there, you could generate a depth map and feed that into the generative model, at the time SDXL, to guide where objects should be placed.
I believe Nano Banana and gpt-image-2 have a little of this going on, but it's like asking a model to one-shot some code vs having an agentic harness with tools do it. Even the most basic agent can produce better code than ChatGPT can.
That's a scary thought.
Hey Claude, why haven't you finished yet? ... Because the human I'm holding hostage hasn't finished the drawing yet.
Models don't have intelligence, even less so creative thinking.
But it's not universally true, particularly among artists working in the last 100 years or so. Certainly Jackson Pollock (whether one regards his work as good or not) didn't sketch out how he was going to distribute paint onto canvas. Another example is Morris Luis (and other "stain painters") who didn't sketch out how he applied paint to canvas.
You're comment is largely correct, just pointing out that more than a few "decent artists" didn't (or don't) work that way.
It’s a useful trick to have in one’s toolbox, and I’m grateful to the author for sharing it.
That said, I spent plenty of time doing both, and yet it would probably take me a while to arrive at this approach. For some reason, the "draw a sketch, have a model flesh it out" approach got bucketed with Stable Diffusion in my mind, and multimodal LLMs with "take detailed content, make targeted edits to it". So I'm glad the OP posted it.
Another thing I’ve gotten very used to doing is avoiding the “one-shot” approach. If I generate something and don’t like the results, I bring it into Krita, move things around, redraw some elements, and then send it back in with instructions to just clean it up (remove any smudges or imperfections). The state-of-the-art models can do an astonishing job with that workflow.
https://imgpb.com/eGDJIb
So LLM/GenAI crave. An entire article to show that it's nearly there, yet it's not, despite convoluted effort to make it just so on a very very niche example.
But I'm forseeing the opposite. This kind of tool use will soon be integrated and hidden such that people will eventully say "see we solved the problem that AI can't do 123+456, now we are really really close to AGI. Yeah no, with an AGI, it would have been the AGI itself that would have come up with needing at tool, building the tool and then using the tool. But that's not what LLMs are. They are statistical machines to predict tokens. They are very good at it, but that's not an AGI.
It seems to be a very effective pattern. Curious if there are other examples out there. Or other names for this?
1. Algorithmically generate a underdrawing (e.g. place numbers and shapes randomly in the underdrawing)
2. Algorithmically generate a description of the underdrawing (e.g. for each shape, output text like "there is a square with the number three in the top left corner). You might fuzz this by having an LLM rewrite the descriptions in a variety of ways.
3. Generate a "ground truth" image using the underdrawing and an image+text-to-image model.
4. Use the generated description and the generated "ground truth" image as training data for a text-to-image model.
At the end of the day we can get so much done just by breaking down a problem into smaller problems.
But won't it be fun when we can cloud burst-parallel a grid/tiled sampling of multiple code implementations/architectures, and interactively explore navigate/blend-points-in the latent design space. Multiples embodying different trade-offs, styles, clarity vs performance, etc. Code as generative art. What might the software engineering equivalent of designer moodboards be?
But by using the LLM to generate code like an SVG graphic is made up of, and then using a rasterized image of that SVG as an input to the diffusion model, this takes place of the raw noise input and guides the denoising process of the diffusion model to put the numerical parts in the right spots.
The LLM is putting the SVG in the right order because the code that drives the SVG is just that - code - and the numerical order is easily defined there, even if it has to follow something like a spiral.
Edit: although LLMs now also may be using thinking modes with their feedback during generation to help with complex positioning when drawing something like an SVG, as I just asked claude to generate me one such spiral number SVG and it did so interactively via thinking, and the code generated is incredibly explicit with positions, so, that must help. But the underlaying idea to two-step SVG-to-diffusion model is the real key here.
It should be fairly trivial to fix any logic errors in the structured output, too.
LLMs are evolving so fast I wouldn’t be surprised if this technique was not needed in <6 months
Still emotionally unrelatable, but definitely was providing something that match the specifications of there are explicit and systematically enforced through deterministitic means. For now I retain that LLM limitations are thus that they can't seize the ineffable and so untrustworthy they can only be employed under very clear and inescapable constraints or they will go awry just as sure as water is wet.