Semantic Alignment for Pose-Invariant Identity Preserving Diffusion

Abstract

Recent T2I diffusion models have evolved to control multiple conditions, including structure, appearance, and text prompt. Despite this progress, training-based methods demand heavy computation, whereas training-free methods often `re-imagine' the subject to satisfy given structure, thereby compromising identity preservation and attenuating fine textures. We propose SeAl (Semantic Alignment for Pose-Invariant Identity Preserving Diffusion), a novel training-free framework that addresses the 're-imagining' problem from the perspective of 'infusion'. SeAl integrates structure, appearance, and text prompt with three modules: AnchorAlign pre-aligns spatial discrepancies, Reference-guided Appearance Infusion injects identity via semantic matching, and Delta-Bridge leverages the guidance delta to mediate text–appearance conflicts. We demonstrate that our method faithfully reflects all three control factors and dramatically reduces the identity leakage endemic to prior methods. Notably, SeAl excels on challenging datasets where identity preservation typically fails (e.g., distinctive animal features or complex human attire), establishing a novel paradigm for training-free identity preservation in diffusion models.

Architecture

SeAl is a novel training-free framework that achieves comprehensive semantic alignment by integrating geometric pre-processing, U-Net attention control, and guidance manipulation during sampling. It systematically resolves structural misalignment between control images, infuses the subject's identity via direct attention injection, and incorporates the user's textual intent through guidance correction. The process consists of three key components:

AnchorAlign

Pre-aligns spatial discrepancies by geometrically matching the structure to the appearance anchor via a two-step scale-and-translate operation, establishing a stable foundation for fusion.

Reference-guided
Appearance Infusion

Directly injects fine-grained identity without corrupting the target structure by caching and selectively infusing appearance Key and Value pairs into self-attention blocks based on semantic matching.

Delta-Bridge

$$g_{gen} = \epsilon_{\theta}(z_t, c_{gen}) - \epsilon_{\theta}(z_t, \emptyset)$$ $$g_{app} = \epsilon_{\theta}(z_t, c_{app}) - \epsilon_{\theta}(z_t, \emptyset)$$ $$\Delta = g_{gen} - g_{app}$$ $$\hat{\epsilon}_{final} = \epsilon_{\theta}(z_t, \emptyset) + w \cdot g_{gen} + \lambda(\gamma) \Delta_{filtered}$$

Reinterprets the semantic discrepancy between the text and appearance as a controllable correction signal ($\Delta$). It smoothly mediates conflicts by steering the generation purely toward textual changes without destroying the base identity.

Conclusion

We have introduced SeAl, a novel training-free framework that achieves comprehensive semantic alignment by integrating geometric pre-processing, U-Net attention control, and guidance manipulation during sampling. This mechanism sequentially resolves structural misalignment between control images, infuses the subject's identity via direct attention injection, and finally incorporates the user's textual intent through guidance correction.

SeAl resolves the long-standing trade-off in controllable T2I diffusion, between text-prompt alignment, structural fidelity, and appearance identity, to generate high-quality images that accurately reflect user intent. Notably, SeAl demonstrates robust and stable performance in preserving both structure and appearance, even for challenging subjects with distinct identities where prior methods have struggled. SeAl provides a foundation for future advancements in image synthesis that more faithfully align with complex user intent.

BibTeX


      coming soon..

Semantic Alignment for Pose-Invariant Identity Preserving Diffusion

SeAl enhances pose-invariant generation in identity preserving diffusion models.
By explicitly aligning semantic features across poses, our approach ensures precise identity capture without degrading image quality or requiring extensive re-training.

Abstract

Architecture

AnchorAlign

Reference-guided
Appearance Infusion

Delta-Bridge

Qualitative Results

Conclusion

BibTeX

Semantic Alignment for Pose-Invariant Identity Preserving Diffusion

SeAl enhances pose-invariant generation in identity preserving diffusion models. By explicitly aligning semantic features across poses, our approach ensures precise identity capture without degrading image quality or requiring extensive re-training.

Abstract

Architecture

AnchorAlign

Reference-guidedAppearance Infusion

Delta-Bridge

Qualitative Results

Conclusion

BibTeX

SeAl enhances pose-invariant generation in identity preserving diffusion models.
By explicitly aligning semantic features across poses, our approach ensures precise identity capture without degrading image quality or requiring extensive re-training.

Reference-guided
Appearance Infusion