Pre-aligns spatial discrepancies by geometrically matching the structure to the appearance anchor via a two-step scale-and-translate operation, establishing a stable foundation for fusion.
Directly injects fine-grained identity without corrupting the target structure by caching and selectively infusing appearance Key and Value pairs into self-attention blocks based on semantic matching.
Reinterprets the semantic discrepancy between the text and appearance as a controllable correction signal ($\Delta$). It smoothly mediates conflicts by steering the generation purely toward textual changes without destroying the base identity.
We have introduced SeAl, a novel training-free framework that achieves comprehensive semantic alignment by integrating geometric pre-processing, U-Net attention control, and guidance manipulation during sampling. This mechanism sequentially resolves structural misalignment between control images, infuses the subject's identity via direct attention injection, and finally incorporates the user's textual intent through guidance correction.
SeAl resolves the long-standing trade-off in controllable T2I diffusion, between text-prompt alignment, structural fidelity, and appearance identity, to generate high-quality images that accurately reflect user intent. Notably, SeAl demonstrates robust and stable performance in preserving both structure and appearance, even for challenging subjects with distinct identities where prior methods have struggled. SeAl provides a foundation for future advancements in image synthesis that more faithfully align with complex user intent.
coming soon..