Stable Diffusion

4308 readers

4 users here now

Discuss matters related to our favourite AI Art generation technology

Also see

Other communities

founded 1 year ago

MODERATORS

db0@lemmy.dbzer0.com

Making a better CLIP interrogator with the FLUX T5 encoder? (lemmy.world)

submitted 2 months ago* (last edited 2 months ago) by AdComfortable1514@lemmy.world to c/stable_diffusion@lemmy.dbzer0.com

6 comments fedilink hide all child comments

This is an open ended question.

I'm not looking for a specific answer , just what people know about this topic.

I've asked this question on Huggingface discord as well.

But hey, asking on lemmy is always good, right? No need to answer here. This is a repost, essentially.

This might serve as an "update" of sorts from the previous post: https://lemmy.world/post/19509682

//---//

Question;

FLUX model uses a combo of CLIP+T5 to create a text_encoding.

CLIP is capable if doing both image_encoding and text_encoding.

T5 model seems to be strictly text-to-text.

So I can't use the T5 to create image_encodings. Right?

https://huggingface.co/docs/transformers/model_doc/t5

But nonetheless, the T5 encoder is used in text-to-image generation.

So surely, there must be good uses for the T5 in creating a better CLIP interrogator?

Ideas/examples on how to do this?

I have 0% knowledge on the T5 , so feel free to just send me a link someplace if you don't want to type out an essay.

//----//

For context;

I'm making my own version of a CLIP interrogator : https://colab.research.google.com/#fileId=https%3A//huggingface.co/codeShare/JupyterNotebooks/blob/main/sd_token_similarity_calculator.ipynb

Key difference is that this one samples the CLIP-vit-large-patch14 tokens directly instead of using pre-written prompts.

I text_encode the tokens individually , store them in a list for later use.

I'm using the method shown in this paper, the "NND-Nearest neighbor decoding" .

Methods for making better CLIP interrogators: https://arxiv.org/pdf/2303.03032

T5 encoder paper : https://arxiv.org/pdf/1910.10683

Example from the notebook where I'm using the NND method on 49K CLIP tokens (Roman girl image) :

Most similiar suffix tokens : "{vfx |cleanup |warcraft |defend |avatar |wall |blu |indigo |dfs |bluetooth |orian |alliance |defence |defenses |defense |guardians |descendants |navis |raid |avengersendgame }"

most similiar prefix tokens : "{imperi-|blue-|bluec-|war-|blau-|veer-|blu-|vau-|bloo-|taun-|kavan-|kair-|storm-|anarch-|purple-|honor-|spartan-|swar-|raun-|andor-}"

you are viewing a single comment's thread
view the rest of the comments

[–] erenkoylu@lemmy.ml 1 points 1 month ago (1 children)

aya based llms are extremely powerful. same with qwen

[–] AdComfortable1514@lemmy.world 2 points 1 month ago (1 children)

That's good to know. I'll try them out. Thanks.

[–] erenkoylu@lemmy.ml 1 points 1 month ago (1 children)

qwen2.5 just came out, and looks amazing. You can try it with ollama.

[–] AdComfortable1514@lemmy.world 2 points 1 month ago

Wow , yeah I found a demo here: https://huggingface.co/spaces/Qwen/Qwen2.5

A whole host of LLM models seems to be released. Thanks for the tip!

I'll see if I can turn them into something useful 👍