A curated collection of papers, models, and resources for the field of Text-to-Image Generation.
Note
This repository is proudly maintained by the frontline research mentors at QuenithAI (应达学术). It aims to provide the most comprehensive and cutting-edge map of papers and technologies in the field of Text-to-Image generation.
Your contributions are also vital—feel free to open an issue or submit a pull request to become a collaborator of this repository. We expect your participation!
If you require expert 1-on-1 guidance on your submissions to top-tier conferences and journals, we invite you to contact us via WeChat or E-mail.
本仓库由 「应达学术」(QuenithAI) 的一线科研导师团队倾力打造并持续维护,旨在为您呈现文生图领域最全面、最前沿的论文。
您的贡献对我们和社区来说至关重要——我们诚邀有志之士通过 open an issue 或 submit a pull request 来成为这个项目的合作者之一,期待您的加入!
⚡ Latest Updates
- (Sep 21th, 2025): Add a new direction: 🔄 Unified Generation and Understanding.
- (Aug 21th, 2025): Add a new direction: 🎨 Personalized Image Generation.
- (Aug 20th, 2025): Initial commit and repository structure established.
-
[CVPR 2025] PreciseCam: Precise Camera Control for Text-to-Image Generation
-
[CVPR 2025] Type‑R: Automatically Retouching Typos for Text‑to‑Image Generation
-
[CVPR 2025] Compass Control: Multi Object Orientation Control for Text‑to‑Image Generation
-
[CVPR 2025] Generative Photography: Scene‑Consistent Camera Control for Realistic Text‑to‑Image Synthesis
-
[CVPR 2025] One‑Way Ticket: Time‑Independent Unified Encoder for Distilling Text‑to‑Image Diffusion Models
-
[CVPR 2025] Text Embedding is Not All You Need: Attention Control for Text‑to‑Image Semantic Alignment with Text Self‑Attention Maps
-
[CVPR 2025] Towards Uncertainty: Understanding and Quantifying Uncertainty for Text‑to‑Image Generation
-
[CVPR 2025] Responsible Diffusion: Plug‑and‑Play Interpretable Responsible Text‑to‑Image Generation via Dual‑Space Multi‑faceted Concept Control
-
[CVPR 2025] Make It Count: Text‑to‑Image Generation with an Accurate Number of Objects
-
[CVPR 2025] MCCD: Multi‑Agent Collaboration‑based Compositional Diffusion for Complex Text‑to‑Image Generation
-
[CVPR 2025] Debias‑SD: Rethinking Training for De‑biasing Text‑to‑Image Generation: Unlocking the Potential of Stable Diffusion
-
[CVPR 2025] ShapeWords: Guiding Text‑to‑Image Synthesis with 3D Shape‑Aware Prompts
-
[CVPR 2025] SnapGen: Taming High‑Resolution Text‑to‑Image Models for Mobile Devices with Efficient Architectures and Training
-
[CVPR 2025] STORM: Spatial Transport Optimization by Repositioning Attention Map for Training‑Free Text‑to‑Image Synthesis
-
[CVPR 2025] Focus‑N‑Fix: Region‑Aware Fine‑Tuning for Text‑to‑Image Generation
-
[CVPR 2025] SILMM: Self‑Improving Large Multimodal Models for Compositional Text‑to‑Image Generation
-
[CVPR 2025] GLoCE: Localized Concept Erasure for Text‑to‑Image Diffusion Models Using Training‑Free Gated Low‑Rank Adaptation
-
[CVPR 2025] Self‑Cross Guidance: Self‑Cross Diffusion Guidance for Text‑to‑Image Synthesis of Similar Subjects
-
[CVPR 2025] Noise Diffusion: Enhancing Semantic Faithfulness in Text‑to‑Image Synthesis
-
[CVPR 2025] PromptSampler: Learning to Sample Effective and Diverse Prompts for Text‑to‑Image Generation
-
[CVPR 2025] STEREO: A Two‑Stage Framework for Adversarially Robust Concept Erasing from Text‑to‑Image Diffusion Models
-
[CVPR 2025] MinorityPrompt: Minority‑Focused Text‑to‑Image Generation via Prompt Optimization
-
[CVPR 2025] DistillT5: Scaling Down Text Encoders of Text‑to‑Image Diffusion Models
-
[CVPR 2025] TIU: The Illusion of Unlearning: The Unstable Nature of Machine Unlearning in Text‑to‑Image Diffusion Models
-
[CVPR 2025] Fuse‑DiT: Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text‑to‑Image Synthesis
-
[CVPR 2025] Detect‑and‑Guide: Self‑regulation of Diffusion Models for Safe Text‑to‑Image Generation via Guideline Token Optimization
-
[CVPR 2025] Multi‑Group T2I: Multi‑Group Proportional Representations for Text‑to‑Image Models
-
[CVPR 2025] VODiff: Controlling Object Visibility Order in Text‑to‑Image Generation
-
[CVPR 2025] Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
-
[CVPR 2025] Six‑CD: Benchmarking Concept Removals for Text-to-image Diffusion Models
-
[CVPR 2025] ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation
-
[CVPR 2025] ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting
-
[ICLR 2025] Improving Long‑Text Alignment: Improving Long‑Text Alignment for Text‑to‑Image Diffusion Models
-
[ICLR 2025] ITTA: Information Theoretic Text‑to‑Image Alignment
-
[ICLR 2025] Meissonic: Revitalizing Masked Generative Transformers for Efficient High‑Resolution Text‑to‑Image Synthesis
-
[ICLR 2025] PaRa: Personalizing Text‑to‑Image Diffusion via Parameter Rank Reduction
-
[ICLR 2025] Fluid: Scaling Autoregressive Text‑to‑image Generative Models with Continuous Tokens
-
[ICLR 2025] Prompt‑Pruning: Not All Prompts Are Made Equal – Prompt‑based Pruning of Text‑to‑Image Diffusion Models
-
[ICLR 2025] Denoising AR Transformers: Denoising Autoregressive Transformers for Scalable Text‑to‑Image Generation
-
[ICLR 2025] Progressive Compositionality: Progressive Compositionality in Text‑to‑Image Generative Models
-
[ICLR 2025] Classifier Scores: Mining your own secrets: Diffusion Classifier Scores for Continual Personalization of Text‑to‑Image Diffusion Models
-
[ICLR 2025] Engagement: Measuring and Improving Engagement of Text‑to‑Image Generation Models
-
[ICLR 2025] Residual Gate Eraser: Concept Pinpoint Eraser for Text‑to-image Diffusion Models via Residual Attention Gate
-
[ICLR 2025] Random Seeds: Enhancing Compositional Text‑to‑Image Generation with Reliable Random Seeds
-
[ICLR 2025] One‑Prompt‑One‑Story: Free‑Lunch Consistent Text‑to‑Image Generation Using a Single Prompt
-
[ICLR 2025] You Only Sample Once: Taming One‑Step Text‑to‑Image Synthesis by Self‑Cooperative Diffusion GANs
-
[ICLR 2025] Copyright Revisiting: Rethinking Artistic Copyright Infringements in the Era of Text‑to‑Image Generative Models
-
[ICLR 2025] Concept Combination Erasing: Erasing Concept Combination from Text‑to‑Image Diffusion Model
-
[ICLR 2025] Cross‑Attention Patterns: Cross‑Attention Head Position Patterns Can Align with Human Visual Concepts in Text‑to‑Image Generative Models
-
[ICLR 2025] TIGeR: Unifying Text‑to‑Image Generation and Retrieval with Large Multimodal Models
-
[ICLR 2025] DGQ: Distribution‑Aware Group Quantization for Text‑to‑Image Diffusion Models
-
[ICLR 2025] Jacobi Decoding: Accelerating Auto‑regressive Text‑to‑Image Generation with Training‑free Speculative Jacobi Decoding
-
[ICLR 2025] PT‑T2I/V: An Efficient Proxy‑Tokenized Diffusion Transformer for Text‑to‑Image/Video Task
-
[ICLR 2025] Gecko Evaluation: Revisiting Text‑to‑Image Evaluation with Gecko: on Metrics, Prompts, and Human Rating
-
[ICLR 2025] SANA: Efficient High‑Resolution Text‑to‑Image Synthesis with Linear Diffusion Transformers
-
[ICLR 2025] Rectified Flow: Text‑to‑Image Rectified Flow as Plug‑and‑Play Priors
-
[ICLR 2025] Human Feedback Filtering: Automated Filtering of Human Feedback Data for Aligning Text‑to‑Image Diffusion Models
-
[ICLR 2025] SAFREE: Training‑Free and Adaptive Guard for Safe Text‑to‑Image and Video Generation
-
[ICLR 2025] IterComp: Iterative Composition‑Aware Feedback Learning from Model Gallery for Text‑to‑Image Generation
-
[ICLR 2025] ScImage: How good are multimodal large language models at scientific text‑to‑image generation?
-
[ICLR 2025] Score Distillation: Guided Score Identity Distillation for Data‑Free One‑Step Text‑to‑Image Generation
-
[ICLR 2025] Causal Variation: Evaluating Semantic Variation in Text‑to‑Image Synthesis: A Causal Perspective
- Plot'n Polish: Zero‑shot Story Visualization and Disentangled Editing with Text‑to‑Image Diffusion Models
- Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model
- PromptEnhancer: A Simple Approach to Enhance Text‑to‑Image Models via Chain‑of‑Thought Prompt Rewriting
- From Editor to Dense Geometry Estimator
- Noisy Label Refinement with Semantically Reliable Synthetic Images
- MEPG:Multi‑Expert Planning and Generation for Compositionally‑Rich Image Generation
- Easier Painting Than Thinking: Can Text‑to‑Image Models Set the Stage, but Not Direct the Play?
- Fidelity‑preserving enhancement of ptychography with foundational text‑to‑image models
- Exploring Diffusion Models for Generative Forecasting of Financial Charts
- Data‑Driven Loss Functions for Inference‑Time Optimization in Text‑to‑Image Generation
- Palette Aligned Image Diffusion
- Draw‑In‑Mind: Learning Precise Image Editing via Chain‑of‑Thought Imagination
- Discrete Noise Inversion for Next‑scale Autoregressive Text‑based Image Editing
- Q‑Sched: Pushing the Boundaries of Few‑Step Diffusion Models with Quantization‑Aware Scheduling
- RealMat: Realistic Materials with Diffusion and Reinforcement Learning
- CompSlider: Compositional Slider for Disentangled Multiple‑Attribute Image Generation
- Prompting Away Stereotypes? Evaluating Bias in Text‑to‑Image Models for Occupations
- Multi‑Level CLS Token Fusion for Contrastive Learning in Endoscopy Image Classification
- HADIS: Hybrid Adaptive Diffusion Model Serving for Efficient Text‑to‑Image Generation
- AMCR: A Framework for Assessing and Mitigating Copyright Risks in Generative Models
- Reusing Computation in Text‑to‑Image Diffusion for Efficient Generation of Image Sets
- Understanding and evaluating computer vision models through the lens of counterfactuals
- Pref‑GRPO: Pairwise Preference Reward‑based GRPO for Stable Text‑to‑Image Reinforcement Learning
- Persode: Personalized Visual Journaling with Episodic Memory‑Aware AI Agent
- Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent
- Safe‑Control: A Safety Patch for Mitigating Unsafe Content in Text‑to‑Image Generation Models
- Not Every Gift Comes in Gold Paper or with a Red Ribbon: Exploring Color Perception in Text‑to‑Image Models
- MonoRelief V2: Leveraging Real Data for High‑Fidelity Monocular Relief Recovery
- All‑in‑One Slider for Attribute Manipulation in Diffusion Models
- Visual‑CoG: Stage‑Aware Reinforcement Learning with Chain of Guidance for Text‑to‑Image Generation
- CEIDM: A Controlled Entity and Interaction Diffusion Model for Enhanced Text‑to‑Image Generation
- Instant Preference Alignment for Text‑to‑Image Diffusion Models
- T2I‑ReasonBench: Benchmarking Reasoning‑Informed Text‑to‑Image Generation
- Bias Amplification in Stable Diffusion's Representation of Stigma Through Skin Tones and Their Homogeneity
- An LLM‑LVLM Driven Agent for Iterative and Fine‑Grained Image Editing
- HiCache: Training‑free Acceleration of Diffusion Models via Hermite Polynomial‑based Feature Caching
- Delta‑SVD: Efficient Compression for Personalized Text‑to‑Image Models
- Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely‑Controllable Synthetic Data
- A Framework for Benchmarking Fairness‑Utility Trade‑offs in Text‑to‑Image Models via Pareto Frontiers
- A‑FloPS: Accelerating Diffusion Sampling with Adaptive Flow Path Sampler
- UniEM‑3M: A Universal Electron Micrograph Dataset for Microstructural Segmentation and Generation
- RAGSR: Regional Attention Guided Diffusion for Image Super‑Resolution
- Scaling Group Inference for Diverse and High‑Quality Generation
- Waver: Wave Your Way to Lifelike Video Generation
- GenTune: Toward Traceable Prompts to Improve Controllability of Image Refinement in Environment Design
- Side Effects of Erasing Concepts from Diffusion Models
- CurveFlow: Curvature‑Guided Flow Matching for Image Generation
- SATURN: Autoregressive Image Generation Guided by Scene Graphs
- MUSE: Multi‑Subject Unified Synthesis via Explicit Layout Semantic Expansion
- CTA‑Flux: Integrating Chinese Cultural Semantics into High‑Quality English Text‑to‑Image Communities
- Sealing The Backdoor: Unlearning Adversarial Text Triggers In Diffusion Models Using Knowledge Distillation
- Inference Time Debiasing Concepts in Diffusion Models
- Pixels Under Pressure: Exploring Fine‑Tuning Paradigms for Foundation Models in High‑Resolution Medical Imaging
- SAGA: Learning Signal‑Aligned Distributions for Improved Text‑to‑Image Generation
- UniECS: Unified Multimodal E‑Commerce Search Framework with Gated Cross‑modal Fusion
- DiffIER: Optimizing Diffusion Models with Iterative Error Reduction
- 7Bench: a Comprehensive Benchmark for Layout‑guided Text‑to‑image Models
- S²‑Guidance: Stochastic Self Guidance for Training‑Free Enhancement of Diffusion Models
- Single‑Reference Text‑to‑Image Manipulation with Dual Contrastive Denoising Score
- DeCoT: Decomposing Complex Instructions for Enhanced Text‑to‑Image Generation with Large Language Models
- Navigating the Exploration‑Exploitation Tradeoff in Inference‑Time Scaling of Diffusion Models
- SafeCtrl: Region‑Based Safety Control for Text‑to‑Image Diffusion via Detect‑Then‑Suppress
- LoRAtorio: An intrinsic approach to LoRA Skill Composition
- SPG: Style‑Prompting Guidance for Style‑Specific Content Creation
- Match & Choose: Model Selection Framework for Fine‑tuning Text‑to‑Image Diffusion Models
- NextStep‑1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
- CountCluster: Training‑Free Object Quantity Guidance with Cross‑Attention Map Clustering for Text‑to‑Image Generation
- NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer
- Translation of Text Embedding via Delta Vector to Suppress Strongly Entangled Content in Text‑to‑Image Diffusion Models
- High Fidelity Text to Image Generation with Contrastive Alignment and Structural Guidance
- Echo‑4o: Harnessing the Power of GPT‑4o Synthetic Images for Improved Image Generation
- WeDesign: Generative AI‑Facilitated Community Consultations for Urban Public Space Design
- Images Speak Louder Than Scores: Failure Mode Escape for Enhancing Generative Quality
- Dual Recursive Feedback on Generation and Appearance Latents for Pose‑Robust Text‑to‑Image Diffusion
- Understanding Dementia Speech Alignment with Diffusion‑Based Image Generation
- Per‑Query Visual Concept Learning
- TARA: Token‑Aware LoRA for Composable Personalization in Diffusion Models
- Exploring Palette based Color Guidance in Diffusion Models
- SafeFix: Targeted Model Repair via Controlled Image Generation
- CLUE: Leveraging Low‑Rank Adaptation to Capture Latent Uncovered Evidence for Image Forgery Localization
- CoAR: Concept Injection into Autoregressive Models for Personalized Text‑to‑Image Generation
- Multi‑task Adversarial Attacks against Black‑box Model with Few‑shot Queries
- Explainability‑in‑Action: Enabling Expressive Manipulation and Tacit Understanding by Bending Diffusion Models in ComfyUI
- Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities
- HiMat: DiT‑based Ultra‑High Resolution SVBRDF Generation
- CannyEdit: Selective Canny Control and Dual‑Prompt Guidance for Training‑Free Image Editing
- AR‑GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning
- Talk2Image: A Multi‑Agent System for Multi‑Turn Image Generation and Editing
- Towards Effective Prompt Stealing Attack against Text‑to‑Image Diffusion Models
- Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video
- VISTAR: A User‑Centric and Role‑Driven Benchmark for Text‑to‑Image Evaluation
- NEP: Autoregressive Image Editing via Next Editing Token Prediction
- Learning 3D Texture‑Aware Representations for Parsing Diverse Human Clothing and Body Parts
- UnGuide: Learning to Forget with LoRA‑Guided Diffusion Models
- Whose Truth? Pluralistic Geo‑Alignment for (Agentic) AI
- UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text‑to‑Image Generation
- Textual Inversion for Efficient Adaptation of Open‑Vocabulary Object Detectors Without Forgetting
- ACM Multimedia Grand Challenge on ENT Endoscopy Analysis
-
[CVPR 2024] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
-
[CVPR 2024] InstanceDiffusion: Instance-level Control for Image Generation
-
[CVPR 2024] ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
-
[CVPR 2024] Instruct-Imagen: Image Generation with Multi-modal Instruction
-
[CVPR 2024] Continuous 3D Words: Learning Continuous 3D Words for Text-to-Image Generation
-
[CVPR 2024] HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances
-
[CVPR 2024] Rich Human Feedback: Rich Human Feedback for Text-to-Image Generation
-
[CVPR 2024] MarkovGen: Structured Prediction for Efficient Text-to-Image Generation
-
[CVPR 2024] Customization Assistant: Customization Assistant for Text-to-image Generation
-
[CVPR 2024] ADI: Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation
-
[CVPR 2024] UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
-
[CVPR 2024] Interpret Diffusion: Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation
-
[CVPR 2024] Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting
-
[CVPR 2024] CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation
-
[CVPR 2024] Arbitrary‑Scale Diffusion: Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder
-
[CVPR 2024] Human-Centric Priors: Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation
-
[CVPR 2024] ElasticDiffusion: Training-free Arbitrary Size Image Generation
-
[CVPR 2024] CosmicMan: A Text-to-Image Foundation Model for Humans
-
[CVPR 2024] PanFusion: Taming Stable Diffusion for Text to 360° Panorama Image Generation
-
[CVPR 2024] Intelligent Grimm: Open-ended Visual Storytelling via Latent Diffusion Models
-
[CVPR 2024] Scalability: On the Scalability of Diffusion-based Text-to-Image Generation
-
[CVPR 2024] MuLAn: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation
-
[CVPR 2024] Multi-dimensional Preferences: Learning Multi-dimensional Human Preference for Text-to-Image Generation
-
[CVPR 2024] Dynamic Prompts: Dynamic Prompt Optimizing for Text-to-Image Generation
-
[CVPR 2024] Reinforcement Diversification: Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning
-
[CVPR 2024] HypercGAN: Adversarial Text to Continuous Image Generation
-
[CVPR 2024] EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
-
[ECCV 2024] LaVi‑Bridge: Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation
-
[ECCV 2024] DiffPNG: Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model
-
[ECCV 2024] SPRIGHT: Getting it Right: Improving Spatial Consistency in Text-to-Image Models
-
[ECCV 2024] IndicTTI: Navigating Text-to-Image Generative Bias across Indic Languages
-
[ECCV 2024] Safeguard T2I: Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion
-
[ECCV 2024] Reality-and-Fantasy: The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation
-
[ECCV 2024] RECE: Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models
-
[ECCV 2024] StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models
-
[ECCV 2024] PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation
-
[ECCV 2024] Skewed Relations T2I: Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation
-
[ECCV 2024] Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
-
[ECCV 2024] MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices
-
[ECCV 2024] PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
-
[ECCV 2024] CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion
-
[ICLR 2024] Patched Diffusion Models: Patched Denoising Diffusion Models For High-Resolution Image Synthesis
-
[ICLR 2024] Relay Diffusion: Unifying diffusion process across resolutions for image synthesis
-
[ICLR 2024] SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
-
[ICLR 2024] Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis
-
[ICLR 2024] PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
-
[SIGGRAPH 2024] RGB↔X: Image Decomposition and Synthesis Using Material- and Lighting-aware Diffusion Models
-
[AAAI 2024] Semantic-aware Augmentation: Semantic-aware Data Augmentation for Text-to-image Synthesis
-
[AAAI 2024] Abstract Concepts: Text-to-Image Generation for Abstract Concepts
- Text-to-Image GAN with Pretrained Representations
- VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control
- INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models
- Is Your Text-to-Image Model Robust to Caption Noise?
- DebiasDiff: Debiasing Text-to-image Diffusion Models with Self-discovering Latent Attribute Directions
- Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models
- FameBias: Embedding Manipulation Bias Attack in Text-to-Image Models
- EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation
- AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models
- Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation
- PromptLA: Towards Integrity Verification of Black-box Text-to-Image Diffusion Models
- GALOT: Generative Active Learning via Optimizable Zero-shot Text-to-image Generation
- What makes a good metric? Evaluating automatic metrics for text-to-image consistency
- Maybe you are looking for CroQS: Cross-modal Query Suggestion for Text-to-Image Retrieval
- CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models
- ArtAug: Enhancing Text-to-Image Generation through Synthesis-Understanding Interaction
- A Framework for Critical Evaluation of Text-to-Image Models: Integrating Art Historical Analysis, Artistic Exploration, and Critical Prompt Engineering
- Efficient Scaling of Diffusion Transformers for Text-to-Image Generation
- VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis
- Finding a Wolf in Sheep's Clothing: Combating Adversarial Text-To-Image Prompts with Text Summarization
- AlignGuard: Scalable Safety Alignment for Text-to-Image Generation
- SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training
- Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG
- DECOR:Decomposition and Projection of Text Embeddings for Text-to-Image Customization
- Fast Prompt Alignment for Text-to-Image Generation
- FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
- Preference Adaptive and Sequential Text-to-Image Generation
- Boosting Alignment for Post-Unlearning Text-to-Image Generative Models
- Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty
- SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation
- Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent
- SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models
- LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors
- T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts
- BodyMetric: Evaluating the Realism of Human Bodies in Text-to-Image Generation
- Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization
- DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
- The Role of Text-to-Image Models in Advanced Style Transfer Applications: A Case Study with DALL-E 3
- Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation
- ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts
- ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?
- Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models
- Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
- Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
- Continuous Concepts Removal in Text-to-image Diffusion Models
- Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion
- Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models
- Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
- QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain
- DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models
- EFSA: Episodic Few-Shot Adaptation for Text-to-Image Retrieval
- Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback
- Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects
- All Seeds Are Not Equal: Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds
- An indicator for effectiveness of text-to-image guardrails utilizing the Single-Turn Crescendo Attack (STCA)
- Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation
- Type-R: Automatically Retouching Typos for Text-to-Image Generation
- Reward Incremental Learning in Text-to-Image Generation
- ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting
- Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models
- Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis
- Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory
- CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis
- Text-to-Image Synthesis: A Decade Survey
- In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models
-
[CVPR 2023] GigaGAN: Scaling Up GANs for Text-to-Image Synthesis
-
[CVPR 2023] ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model With Knowledge-Enhanced Mixture-of-Denoising-Experts
-
[CVPR 2023] Shifted Diffusion: Shifted Diffusion for Text-to-image Generation
-
[CVPR 2023] GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis
-
[CVPR 2023] Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models to Learn Any Unseen Style
-
[CVPR 2023] Verifiable Evaluation: Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation
-
[CVPR 2023] RIATIG: Reliable and Imperceptible Adversarial Text-to-Image Generation with Natural Prompts
-
[CVPR 2023] Custom Diffusion: Multi-Concept Customization of Text-to-Image Diffusion
-
[ICCV 2023] DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning
-
[NeurIPS 2023] ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
-
[NeurIPS 2023] RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
-
[NeurIPS 2023] Linguistic Binding: Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment
-
[NeurIPS 2023] DenseDiffusion: Dense Text-to-Image Generation with Attention Modulation
-
[ICLR 2023] Structured Diffusion Guidance: Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
-
[ICML 2023] StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis
-
[ICML 2023] Muse: Text-To-Image Generation via Masked Generative Transformers
-
[ICML 2023] UniDiffusers: One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
-
[ACM MM 2023] SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models
-
[ACM MM 2023] ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors
-
[SIGGRAPH 2023] Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models
- LaCon: Late-Constraint Diffusion for Steerable Guided Image Synthesis
- RenAIssance: A Survey into AI Text‑to‑Image Generation in the Era of Large Model
- Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text‑to‑Image Generative Models
- Dense Text‑to‑Image Generation with Attention Modulation
- AltDiffusion: A Multilingual Text‑to‑Image Diffusion Model
- Noisy‑Correspondence Learning for Text‑to‑Image Person Re‑identification
- Likelihood‑Based Text‑to‑Image Evaluation with Patch‑Level Perceptual and Semantic Credit Assignment
- Learning to Generate Semantic Layouts for Higher Text‑Image Correspondence in Text‑to‑Image Synthesis
- MarkovGen: Structured Prediction for Efficient Text‑to‑Image Generation
- IP‑Adapter: Text Compatible Image Prompt Adapter for Text‑to‑Image Diffusion Models
- Masked‑Attention Diffusion Guidance for Spatially Controlling Text‑to‑Image Generation
- PromptPaint: Steering Text‑to‑Image Generation Through Paint Medium‑like Interactions
- LayoutLLM‑T2I: Eliciting Layout Guidance from LLM for Text‑to‑Image Generation
- Circumventing Concept Erasure Methods for Text‑to‑Image Generative Models
- The Bias Amplification Paradox in Text‑to‑Image Generation
- BAGM: A Backdoor Attack for Manipulating Text‑to‑Image Generative Models
- Subject‑Diffusion: Open Domain Personalized Text‑to‑Image Generation without Test‑time Fine‑tuning
- BoxDiff: Text‑to‑Image Synthesis with Training‑Free Box‑Constrained Diffusion
- Beyond the ML Model: Applying Safety Engineering Frameworks to Text‑to‑Image Development
- Distilling Knowledge from Text‑to‑Image Generative Models Improves Visio‑Linguistic Reasoning in CLIP
- Text‑guided Image Restoration and Semantic Enhancement for Text‑to‑Image Person Retrieval
- PromptMagician: Interactive Prompt Engineering for Text‑to‑Image Creation
- PromptCrafter: Crafting Text‑to‑Image Prompt through Mixed‑Initiative Dialogue with LLM
- Image Captions are Natural Prompts for Text‑to‑Image Models
- Analysing Gender Bias in Text‑to‑Image Models using Object Detection
- Can Pre‑Trained Text‑to‑Image Models Generate Visual Goals for Reinforcement Learning?
- Fast Adaptation with Bradley‑Terry Preference Models in Text‑To‑Image Classification and Generation
- HyperDreamBooth: HyperNetworks for Fast Personalization of Text‑to‑Image Models
- Domain‑Agnostic Tuning‑Encoder for Fast Personalization of Text‑To‑Image Models
- T2I‑CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text‑to‑image Generation
- Towards Safe Self‑Distillation of Internet‑Scale Text‑to‑Image Diffusion Models
- TIAM – A Metric for Evaluating Alignment in Text‑to‑Image Generation
- Articulated 3D Head Avatar Generation using Text‑to‑Image Diffusion Models
- Divide, Evaluate, and Refine: Evaluating and Improving Text‑to‑Image Alignment with Iterative VQA Feedback
- AnimateDiff: Animate Your Personalized Text‑to‑Image Diffusion Models without Specific Tuning
- Augmenters at SemEval‑2023 Task 1: Enhancing CLIP in Handling Compositionality and Ambiguity for Zero‑Shot Visual WSD through Prompt Augmentation and Text‑To‑Image Diffusion
- Typology of Risks of Generative Text‑to‑Image Models
- DIAGNOSIS: Detecting Unauthorized Data Usages in Text‑to‑image Diffusion Models
- On the Cultural Gap in Text‑to‑Image Generation
- Counting Guidance for High Fidelity Text‑to‑Image Synthesis
- CLIPAG: Towards Generator‑Free Text‑to‑Image Generation
- Localized Text‑to‑Image Generation for Free via Cross Attention Control
- A‑STAR: Test‑time Attention Segregation and Retention for Text‑to‑image Synthesis
- Text‑Anchored Score Composition: Tackling Condition Misalignment in Text‑to‑Image Diffusion Models
- Zero‑shot spatial layout conditioning for text‑to‑image diffusion models
- The Cultivated Practices of Text‑to‑Image Generation
- Point‑Cloud Completion with Pretrained Text‑to‑image Diffusion Models
- Energy‑Efficient Downlink Semantic Generative Communication with Text‑to‑Image Generators
- WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text‑to‑Image Diffusion Models
- ConceptBed: Evaluating Concept Learning Abilities of Text‑to‑Image Diffusion Models
- Composition and Deformance: Measuring Imageability with a Text‑to‑Image Model
- Detector Guidance for Multi‑Object Text‑to‑Image Generation
- Word‑Level Explanations for Analyzing Bias in Text‑to‑Image Models
- Multilingual Conceptual Coverage in Text‑to‑Image Models
- Video Colorization with Pre‑trained Text‑to‑Image Diffusion Models
- StyleDrop: Text‑to‑Image Generation in Any Style
- StableRep: Synthetic Images from Text‑to‑Image Models Make Strong Visual Representation Learners
- SnapFusion: Text‑to‑Image Diffusion Model on Mobile Devices within Two Seconds
- ViCo: Plug‑and‑play Visual Condition for Personalized Text‑to‑image Generation
- T2IAT: Measuring Valence and Stereotypical Biases in Text‑to‑Image Generation
- ReFACT: Updating Text‑to‑Image Models by Editing the Text Encoder
- Wuerstchen: An Efficient Architecture for Large‑Scale Text‑to‑Image Diffusion Models
- RealignDiff: Boosting Text‑to‑Image Diffusion Model with Coarse‑to‑fine Semantic Re‑alignment
- Translation‑Enhanced Multilingual Text‑to‑Image Generation
- Controllable Text‑to‑Image Generation with GPT‑4
- RAPHAEL: Text‑to‑Image Generation via Large Mixture of Diffusion Paths
- VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text‑to‑Image Generative Models
-
[AAAI 2025] Simple-ControlNet: Simplifying Control Mechanism in Text-to-Image Diffusion
-
[AAAI 2025] EMControl: Adding Conditional Control to Text-to-Image Diffusion Models via EM
-
[AAAI 2025] Local Conditional Controlling for Text-to-Image Diffusion Models
-
[AAAI 2025] VersaGen: Versatile Visual Control for Text-to-Image Diffusion
-
[ICLR 2025] IFAdapter: Instance Feature Control for Grounded T2I
-
[ICLR 2025] LayerFusion / Harmonized Multi-Layer T2I (Foreground+Background)
-
[ICLR 2025] Enhancing Compositional T2I with Reliable Random Seeds
- Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation
- SafeFix: Targeted Model Repair via Controlled Image Generation
- MultiRef: Controllable Image Generation with Multiple Visual References
- DivControl: Knowledge Diversion for Controllable Image Generation
- A Practical Investigation of Spatially-Controlled Image Generation with Transformers
- ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning
- Dual‑Process Image Generation
- AP‑CAP: Advancing High‑Quality Data Synthesis for Animal Pose Estimation via a Controllable Image Generation Pipeline
- STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout‑to‑Image Generation
- Contract‑Inspired Contest Theory for Controllable Image Generation in Mobile Edge Metaverse
- Grounding Text‑to‑Image Diffusion Models for Controlled High‑Quality Image Generation
- Test‑time Controllable Image Generation by Explicit Spatial Constraint Enforcement
- EliGen: Entity‑Level Controlled Image Generation with Regional Attention
- TIDE: Achieving Balanced Subject‑Driven Image Generation via Target‑Instructed Diffusion Enhancement
- LEARN: A Story‑Driven Layout‑to‑Image Generation Framework for STEM Instruction
- Locality‑aware Parallel Decoding for Efficient Autoregressive Image Generation
- Hyperspectral Image Generation with Unmixing Guided Diffusion Model
- Conditional Panoramic Image Generation via Masked Autoregressive Modeling
- Context‑Aware Autoregressive Models for Multi‑Conditional Image Generation
-
[CVPR 2024] PLACE: Adaptive Layout‑Semantic Fusion for Semantic Image Synthesis
-
[CVPR 2024] One‑Shot Structure‑Aware Stylized Image Synthesis: One‑Shot Structure‑Aware Stylized Image Synthesis
-
[CVPR 2024] Attention Refocusing: Grounded Text‑to‑Image Synthesis with Attention Refocusing
-
[CVPR 2024] CFLD: Coarse‑to‑Fine Latent Diffusion for Pose‑Guided Person Image Synthesis
-
[CVPR 2024] DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception
-
[CVPR 2024] CAN: Condition‑Aware Neural Network for Controlled Image Generation
-
[CVPR 2024] SceneDiffusion: Move Anything with Layered Scene Diffusion
-
[CVPR 2024] Zero‑Painter: Training‑Free Layout Control for Text‑to‑Image Synthesis
-
[CVPR 2024] MIGC: Multi‑Instance Generation Controller for Text‑to‑Image Synthesis
-
[CVPR 2024] FreeControl: Training‑Free Spatial Control of Any Text‑to‑Image Diffusion Model with Any Condition
-
[ECCV 2024] PreciseControl: Enhancing Text‑To‑Image Diffusion Models with Fine‑Grained Attribute Control
-
[ECCV 2024] AnyControl: Create Your Artwork with Versatile Control on Text‑to‑Image Generation
-
[NeurIPS 2024] Ctrl‑X: Controlling Structure and Appearance for Text‑To‑Image Generation Without Guidance
-
[ICLR 2024] PCDMs: Advancing Pose‑Guided Image Synthesis with Progressive Conditional Diffusion Models
-
[WACV 2024] Layout Control with Cross‑Attention Guidance: Training‑Free Layout Control with Cross‑Attention Guidance
-
[AAAI 2024] SSMG: Spatial‑Semantic Map Guided Diffusion Model for Free‑form Layout‑to‑image Generation
-
[AAAI 2024] Attention Map Control: Compositional Text‑to‑Image Synthesis with Attention Map Control of Diffusion Models
- UNIC‑Adapter: Unified Image‑instruction Adapter with Multi‑modal Transformer for Image Generation
- Steering Rectified Flow Models in the Vector Field for Controlled Image Generation
- Enhancing Weakly Supervised Semantic Segmentation for Fibrosis via Controllable Image Generation
- CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation
- ControlAR: Controllable Image Generation with Autoregressive Models
- BrainDreamer: Reasoning‑Coherent and Controllable Image Generation from EEG Brain Signals via Language Guidance
- CSGO: Content‑Style Composition in Text‑to‑Image Generation
- MUSES: 3D‑Controllable Image Generation via Multi‑Modal Agent Collaboration
- Prompt‑Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models
- Controllable Image Generation With Composed Parallel Token Prediction
- Condition‑Aware Neural Network for Controlled Image Generation
- Refining Text‑to‑Image Generation: Towards Accurate Training‑Free Glyph‑Enhanced Image Generation
- GazeFusion: Saliency‑Guided Image Generation
- TCIG: Two‑Stage Controlled Image Generation with Quality Enhancement through Diffusion
- Text2Street: Controllable Text‑to‑Image Generation for Street Views
- Spatial‑Aware Latent Initialization for Controllable Image Generation
- PIXART‑δ: Fast and Controllable Image Generation with Latent Consistency Models
- OmniControlNet: Dual‑stage Integration for Conditional Image Generation
-
[CVPR 2023] GLIGEN: Open-Set Grounded Text-to-Image Generation
-
[CVPR 2022] Autoregressive Image Generation: Using Residual Quantization
-
[CVPR 2023] SpaText: Spatio-Textual Representation for Controllable Image Generation
-
[CVPR 2022] Text to Image Generation with Semantic-Spatial Aware GAN: Text to Image Generation with Semantic-Spatial Aware GAN
-
[CVPR 2023] ReCo: Region-Controlled Text-to-Image Generation
-
[CVPR 2023] LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation
-
[ICLR 2023] Ctrl-U: Robust Conditional Image Generation via Uncertainty-aware Reward Modeling
-
[ICCV 2023] ControlNet: Adding Conditional Control to Text-to-Image Diffusion Models
-
[ICCV 2023] SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis
-
[ICCV 2023] ZestGuide: Zero-Shot Spatial Layout Conditioning for Text-to-Image Diffusion Models
-
[ICML 2023] Composer: Creative and Controllable Image Synthesis with Composable Conditions
-
[ICML 2023] MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation
-
[SIGGRAPH 2023] Sketch-Guided Text-to-Image Diffusion Models: Sketch-Guided Text-to-Image Diffusion Models
-
[NeurIPS 2023] Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
-
[NeurIPS 2023] Prompt Diffusion: In-Context Learning Unlocked for Diffusion Models
-
[WACV 2023] More Control for Free!: Image Synthesis with Semantic Diffusion Guidance
-
[ACM MM 2023] LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation
- Diffusion Self‑Guidance for Controllable Image Generation
- Robust Image Ordinal Regression with Controllable Image Generation
- Controllable Image Generation via Collage Representations
- Diagnostic Benchmark and Iterative Inpainting for Layout‑Guided Image Generation
- NoisyTwins: Class‑Consistent and Diverse Image Generation through StyleGANs
- GlueGen: Plug and Play Multi‑modal Encoders for X‑to‑image Generation
- MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation
-
[CVPR 2025] SerialGen: Personalized Image Generation by First Standardization Then Personalization
-
[CVPR 2025] PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation
-
[CVPR 2025] DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching
-
[NeurIPS 2025] MS-Diffusion: Multi-Subject Zero-shot Image Personalization with Layout Guidance
-
[NeurIPS 2025] ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance
-
[NeurIPS 2025] DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
-
[NeurIPS 2025] TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation
- FocusDPO: Dynamic Preference Optimization for Multi-Subject Personalized Image Generation via Adaptive Focus
- MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation
- Anti-Tamper Protection for Unauthorized Individual Image Generation
- Improving Personalized Image Generation through Social Context Feedback
- A Training-Free Style‑Personalization via Scale‑wise Autoregressive Model
- Personalized Image Generation from an Author Writing Style
- TaleForge: Interactive Multimodal System for Personalized Story Creation
- AlignGen: Boosting Personalized Image Generation with Cross‑Modality Prior Alignment
- RAGAR: Retrieval Augmented Personalized Image Generation Guided by Recommendation
- DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition
- Personalized Text‑to‑Image Generation with Auto‑Regressive Models
- AC‑LoRA: Auto Component LoRA for Personalized Artistic Style Image Generation
- Single Image Iterative Subject‑driven Generation and Editing
- Personalize Anything for Free with Diffusion Transformer
- Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias
- Conceptrol: Concept Control of Zero‑shot Personalized Image Generation
- Personalized Image Generation with Deep Generative Models: A Decade Survey
- Beyond Fine‑Tuning: A Systematic Study of Sampling Techniques in Personalized Image Generation
- Enhanced Multi‑Scale Cross‑Attention for Person Image Generation
- SceneBooth: Diffusion‑based Framework for Subject‑preserved Text‑to‑Image Generation
-
[CVPR 2024] Cross Initialization: Personalized Text‑to‑Image Generation
-
[CVPR 2024] When StyleGAN Meets Stable Diffusion: a W+ Adapter for Personalized Image Generation
-
[CVPR 2024] Style Aligned: Image Generation via Shared Attention
-
[CVPR 2024] InstantBooth: Personalized Text‑to‑Image Generation without Test‑Time Finetuning
-
[CVPR 2024] High Fidelity: Person‑centric Subject‑to‑Image Synthesis
-
[CVPR 2024] RealCustom: Narrowing Real Text Word for Real‑Time Open‑Domain Text‑to‑Image Customization
-
[CVPR 2024] DisenDiff: Attention Calibration for Disentangled Text‑to‑Image Personalization
-
[CVPR 2024] FreeCustom: Tuning‑Free Customized Image Generation for Multi‑Concept Composition
-
[CVPR 2024] Personalized Residuals: for Concept‑Driven Text‑to‑Image Generation
-
[CVPR 2024] Subject‑Agnostic Guidance: Improving Subject‑Driven Image Synthesis
-
[CVPR 2024] JeDi: Joint‑Image Diffusion Models for Finetuning‑Free Personalized Text‑to‑Image Generation
-
[CVPR 2024] Influence Watermarks: Countering Personalized Text‑to‑Image Generation
-
[CVPR 2024] PIA: Your Personalized Image Animator via Plug‑and‑Play Modules in Text‑to‑Image Models
-
[CVPR 2024] SSR‑Encoder: Encoding Selective Subject Representation for Subject‑Driven Generation
-
[ECCV 2024] Be Yourself: Bounded Attention for Multi‑Subject Text‑to‑Image Generation
-
[ECCV 2024] Powerful and Flexible: Personalized Text‑to‑Image Generation via Reinforcement Learning
-
[ECCV 2024] TIGC: Tuning‑Free Image Customization with Image and Text Guidance
-
[ECCV 2024] MasterWeaver: Taming Editability and Face Identity for Personalized Text‑to‑Image Generation
-
[NeurIPS 2024] RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance
-
[NeurIPS 2024] AttnDreamBooth: Towards Text‑Aligned Personalized Image Generation
-
[AAAI 2024] Decoupled Textual Embeddings: for Customized Image Generation
- HyperNet Fields: Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories
- PersonaMagic: Stage-Regulated High-Fidelity Face Customization with Tandem Equilibrium
- LoRACLR: Contrastive Adaptation for Customization of Diffusion Models
- Learning Flow Fields in Attention for Controllable Person Image Generation
- PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation
- SerialGen: Personalized Image Generation by First Standardization Then Personalization
- Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment
- DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models
- DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching
- Personalized Image Generation with Large Multimodal Models
- FaceChain-FACT: Face Adapter with Decoupled Training for Identity-preserved Personalization
- Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation (MIP-Adapter)
- Imagine Yourself: Tuning-Free Personalized Image Generation
- StoryMaker: Towards Holistic Consistent Characters in Text-to-Image Generation
- TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder
- EZIGen: Enhancing Zero-shot Personalized Image Generation with Precise Subject Encoding and Decoupled Guidance
- ViPer: Visual Personalization of Generative Models via Individual Preference Learning
- Layout-and-Retouch: A Dual-stage Framework for Improving Diversity in Personalized Image Generation
- RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance
- FreeTuner: Any Subject in Any Style with Training-free Diffusion
- InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation
- MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation
- CAT: Contrastive Adapter Training for Personalized Image Generation
- MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
- MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration
- IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models
- Fast Personalized Text-to-Image Syntheses With Attention Injection
- Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition
- Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation (SeFi-IDE)
- BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models
-
[CVPR 2023] Custom Diffusion: Multi-Concept Customization of Text-to-Image Diffusion
-
[CVPR 2023] DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
-
[ICCV 2023] ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation
-
[ICLR 2023] Textual Inversion: An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
-
[SIGGRAPH Asia 2023] Break-A-Scene: Extracting Multiple Concepts from a Single Image
-
[SIGGRAPH 2023] Encoder‑Based Domain Tuning: Encoder‑Based Domain Tuning for Fast Personalization of Text‑to‑Image Models
-
[SIGGRAPH 2023] LayerDiffusion: Layered Controlled Image Editing with Diffusion Models
- PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization
- Disentangled Representation Learning for Controllable Person Image Generation
- HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models
- When StyleGAN Meets Stable Diffusion: a \mathscr{W}_+ Adapter for Personalized Image Generation
- CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization
- FaceChain: A Playground for Human-centric Artificial Intelligence Generated Content
- Subject-Diffusion: Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning
- FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
- UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer
- Identity Encoder for Personalized Diffusion
- InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
- Semantically Consistent Person Image Generation
- Learning Invariance from Generated Variance for Unsupervised Person Re-identification
-
[CVPR 2025] FDS: Frequency‑Aware Denoising Score for Text‑Guided Latent Diffusion Image Editing
-
[CVPR 2025] Reference‑Based 3D‑Aware Image Editing with Triplanes
-
[CVPR 2025] MoEdit: On Learning Quantity Perception for Multi‑object Image Editing
-
[ICLR 2025] Lightning‑Fast Image Inversion and Editing for Text‑to‑Image Diffusion Models
-
[ICLR 2025] Multi‑Reward as Condition for Instruction‑based Image Editing
-
[ICLR 2025] HQ‑Edit: A High‑Quality Dataset for Instruction‑based Image Editing
-
[ICLR 2025] CLIPDrag: Combining Text‑based and Drag‑based Instructions for Image Editing
-
[ICLR 2025] Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations
-
[ICLR 2025] PostEdit: Posterior Sampling for Efficient Zero‑Shot Image Editing
-
[ICLR 2025] OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
- Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping
- From Editor to Dense Geometry Estimator
- Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control
- Fidelity-preserving enhancement of ptychography with foundational text-to-image models
- Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination
- Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing
- Delta Velocity Rectified Flow for Text-to-Image Editing
- Neural Scene Designer: Self-Styled Semantic Image Manipulation
- LatentEdit: Adaptive Latent Control for Consistent Semantic Editing
- Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation
- Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent
- Not Every Gift Comes in Gold Paper or with a Red Ribbon: Exploring Color Perception in Text-to-Image Models
- SpotEdit: Evaluating Visually-Guided Image Editing Methods
- An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing
- Defending Deepfake via Texture Feature Perturbation
- PosBridge: Multi-View Positional Embedding Transplant for Identity-Aware Image Editing
- Visual Autoregressive Modeling for Instruction-Guided Image Editing
- Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing
- Single-Reference Text-to-Image Manipulation with Dual Contrastive Denoising Score
- PEdger++: Practical Edge Detection via Assembling Cross Information
- TimeMachine: Fine-Grained Facial Age Editing with Identity Preservation
- NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
- A Segmentation-driven Editing Method for Bolt Defect Augmentation and Detection
- TweezeEdit: Consistent and Efficient Image Editing with Path Regularization
- Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control
- Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
- X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning
- Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing
- CLUE: Leveraging Low-Rank Adaptation to Capture Latent Uncovered Evidence for Image Forgery Localization
- CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing
- Talk2Image: A Multi-Agent System for Multi-Turn Image Generation and Editing
- UGD-IML: A Unified Generative Diffusion-based Framework for Constrained and Unconstrained Image Manipulation Localization
- DreamVE: Unified Instruction-based Image and Video Editing
- NEP: Autoregressive Image Editing via Next Editing Token Prediction
- InstantEdit: Text-Guided Few-Step Image Editing with Piecewise Rectified Flow
- Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation
- Zero Shot Domain Adaptive Semantic Segmentation by Synthetic Data Generation and Progressive Adaptation
- LORE: Latent Optimization for Precise Semantic Control in Rectified Flow-based Image Editing
- UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying
- Transport-Guided Rectified Flow Inversion: Improved Image Editing Using Optimal Transport Theory
- Qwen-Image Technical Report
- The Promise of RL for Autoregressive Image Editing
- Towards Robust Semantic Correspondence: A Benchmark and Insights
- Training-free Geometric Image Editing on Diffusion Models
- UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
- GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset
- Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling
- ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement
- Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling
- LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs
- Light Future: Multimodal Action Frame Prediction via InstructPix2Pix
- NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining
- Moodifier: MLLM-Enhanced Emotion-Driven Image Editing
- EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing
- Sparse Fine-Tuning of Transformers for Generative Tasks
- LayLens: Improving Deepfake Understanding through Simplified Explanations
- FlowDrag: 3D-aware Drag-based Image Editing with Mesh-guided Deformation Vector Flow Fields
- ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation
- 2D Instance Editing in 3D Space
- Neural-Driven Image Editing
- Beyond Simple Edits: X‑Planner for Complex Instruction-Based Image Editing
- S$^2$Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control
- Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images
- LACONIC: A 3D Layout Adapter for Controllable Image Creation
- Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning
- ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation
- QC-OT: Optimal Transport with Quasiconformal Mapping
- A Unified Framework for Stealthy Adversarial Generation via Latent Optimization and Transferability Enhancement
- TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity
- OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions
- Ovis-U1 Technical Report
- Towards Explainable Bilingual Multimodal Misinformation Detection and Localization
- GenEscape: Hierarchical Multi-Agent Generation of Escape Room Puzzles
- Controllable 3D Placement of Objects with Scene-Aware Diffusion Models
- Improving Diffusion-Based Image Editing Faithfulness via Guidance and Scheduling
- M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization
- FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing
- EditP23: 3D Editing via Propagation of Image Prompts to Multi-View
- Towards Efficient Exemplar Based Image Editing with Multimodal VLMs
- SceneCrafter: Controllable Multi-View Driving Scene Editing
- Inverse-and-Edit: Effective and Fast Image Editing by Cycle Consistency Models
- OmniGen2: Exploration to Advanced Multimodal Generation
- CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing
- Instability in Diffusion ODEs: An Explanation for Inaccurate Image Reconstruction
- FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation
- Arch-Router: Aligning LLM Routing with Human Preferences
- VectorEdits: A Dataset and Benchmark for Instruction-Based Editing of Vector Graphics
- AttentionDrag: Exploiting Latent Correlation Knowledge in Pre-trained Diffusion Models for Image Editing
- Balancing Preservation and Modification: A Region and Semantic Aware Metric for Instruction-Based Image Editing
- ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies
- SphereDrag: Spherical Geometry-Aware Panoramic Image Editing
- VINCIE: Unlocking In-context Image Editing from Video
- Edit360: 2D Image Edits to 3D Assets from Any Angle
- EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits
- ELBO-T2IAlign: A Generic ELBO-Based Method for Calibrating Pixel-level Text-Image Alignment in Diffusion Models
- Ming-Omni: A Unified Multimodal Model for Perception and Generation
- Fine-Grained Spatially Varying Material Selection in Images
- Do Concept Replacement Techniques Really Erase Unacceptable Concepts?
- RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping
- Highly Compressed Tokenizer Can Generate Without Training
- PairEdit: Learning Semantic Variations for Exemplar-based Image Editing
- Diffusion Counterfactual Generation with Semantic Abduction
- DragNeXt: Rethinking Drag-Based Image Editing
- Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models
- FADE: Frequency-Aware Diffusion Model Factorization for Video Editing
- Towards Reliable Identification of Diffusion-based Image Manipulations
- SeedEdit 3.0: Fast and High-Quality Generative Image Editing
- Invisible Backdoor Triggers in Image Editing Model via Deep Watermarking
-
[CVPR 2024] InfEdit: Inversion‑Free Image Editing with Natural Language
-
[CVPR 2024] CrossSelfAttention: Towards Understanding Cross and Self‑Attention in Stable Diffusion for Text‑Guided Image Editing
-
[CVPR 2024] DAC: Doubly Abductive Counterfactual Inference for Text‑based Image Editing
-
[CVPR 2024] FoI: Focus on Your Instruction: Fine‑grained and Multi‑instruction Image Editing by Attention Modulation
-
[CVPR 2024] CDS: Contrastive Denoising Score for Text‑guided Latent Diffusion Image Editing
-
[CVPR 2024] DragDiffusion: Harnessing Diffusion Models for Interactive Point‑based Image Editing
-
[CVPR 2024] DiffEditor: Boosting Accuracy and Flexibility on Diffusion‑based Image Editing
-
[CVPR 2024] FreeDrag: Feature Dragging for Reliable Point‑based Image Editing
-
[CVPR 2024] Learnable Regions: Text‑Driven Image Editing via Learnable Regions
-
[CVPR 2024] LEDITS++: Limitless Image Editing using Text‑to‑Image Models
-
[CVPR 2024] SmartEdit: Exploring Complex Instruction‑based Image Editing with Large Language Models
-
[CVPR 2024] Edit One for All: Interactive Batch Image Editing
-
[CVPR 2024] DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing
-
[CVPR 2024] TiNO‑Edit: Timestep and Noise Optimization for Robust Diffusion‑Based Image Editing
-
[CVPR 2024] Person in Place: Generating Associative Skeleton‑Guidance Maps for Human‑Object Interaction Image Editing
-
[CVPR 2024] Referring Image Editing: Object‑level Image Editing via Referring Expressions
-
[CVPR 2024] Prompt Augmentation: Prompt Augmentation for Self‑supervised Text‑guided Image Manipulation
-
[CVPR 2024] StyleFeatureEditor: The Devil is in the Details — StyleFeatureEditor for Detail‑Rich StyleGAN Inversion and High Quality Image Editing
-
[ECCV 2024] RegionDrag: Fast Region‑Based Image Editing with Diffusion Models
-
[ECCV 2024] InstructGIE: Towards Generalizable Image Editing
-
[ECCV 2024] StableDrag: Stable Dragging for Point‑based Image Editing
-
[ECCV 2024] Eta Inversion: Designing an Optimal Eta Function for Diffusion‑based Real Image Editing
-
[ECCV 2024] SwapAnything: Enabling Arbitrary Object Swapping in Personalized Image Editing
-
[ECCV 2024] Guide‑and‑Rescale: Self‑Guidance Mechanism for Effective Tuning‑Free Real Image Editing
-
[ECCV 2024] FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models
-
[ECCV 2024] Lazy Diffusion Transformer: Lazy Diffusion Transformer for Interactive Image Editing
-
[ECCV 2024] ByteEdit: Boost, Comply and Accelerate Generative Image Editing
-
[ICLR 2024] MGIE: Guiding Instruction‑based Image Editing via Multimodal Large Language Models
-
[ICLR 2024] SDE‑Drag: The Blessing of Randomness — SDE Beats ODE in General Diffusion‑based Image Editing
-
[ICLR 2024] Motion Guidance: Diffusion‑Based Image Editing with Differentiable Motion Estimators
-
[ICLR 2024] OIR: Object‑Aware Inversion and Reassembly for Image Editing
-
[ICLR 2024] Noise Map Guidance: Inversion with Spatial Context for Real Image Editing
-
[AAAI 2024] TIC: Tuning‑Free Inversion‑Enhanced Control for Consistent Image Editing
-
[AAAI 2024] BARET: Balanced Attention based Real Image Editing driven by Target‑text Inversion
-
[AAAI 2024] CacheEdit: Accelerating Text‑to‑Image Editing via Cache‑Enabled Sparse Diffusion Inference
-
[AAAI 2024] High‑Fidelity Editing: High‑Fidelity Diffusion‑based Image Editing
-
[AAAI 2024] AdapEdit: Spatio‑Temporal Guided Adaptive Editing Algorithm for Text‑Based Continuity‑Sensitive Image Editing
-
[AAAI 2024] TexFit: Text‑Driven Fashion Image Editing with Diffusion Models
- Edicho: Consistent Image Editing in the Wild
- Unforgettable Lessons from Forgettable Images: Intra-Class Memorability Matters in Computer Vision
- MADiff: Text-Guided Fashion Image Editing with Mask Prediction and Attention-Enhanced Diffusion
- DRDM: A Disentangled Representations Diffusion Model for Synthesizing Realistic Person Images
- Fashionability-Enhancing Outfit Image Editing with Conditional Diffusion Models
- The Superposition of Diffusion Models Using the Itô Density Estimator
- Mapping the Mind of an Instruction-based Image Editing using SMILE
- Diffusion-Based Conditional Image Editing through Optimized Inference with Guidance
- UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency
- Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion
- Text2Relight: Creative Portrait Relighting with Text Guidance
- Prompt Augmentation for Self-supervised Text-guided Image Manipulation
- Unsupervised Region-Based Image Editing of Denoising Diffusion Models
- Pattern Analogies: Learning to Perform Programmatic Image Edits by Analogy
- Dual-Schedule Inversion: Training- and Tuning-Free Inversion for Real Image Editing
- BrushEdit: All-In-One Image Inpainting and Editing
- Learning Complex Non-Rigid Image Edits from Multimodal Conditioning
- Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG
- FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers
- PrEditor3D: Fast and Precise 3D Shape Editing
- MoViE: Mobile Diffusion for Video Editing
- GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis
- Text-to-3D Generation by 2D Editing
-
[CVPR 2023] Diffusion Disentanglement: Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
-
[CVPR 2023] SINE: SINgle Image Editing with Text-to-Image Diffusion Models
-
[CVPR 2023] Imagic: Text-Based Real Image Editing with Diffusion Models
-
[CVPR 2023] InstructPix2Pix: Learning to Follow Image Editing Instructions
-
[CVPR 2023] Null-text Inversion: Null-text Inversion for Editing Real Images using Guided Diffusion Models
-
[ICCV 2023] MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
-
[ICCV 2023] Local Prompt Mixing: Localizing Object-level Shape Variations with Text-to-Image Diffusion Models
-
[ICLR 2022] SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
- ZONE: Zero-Shot Instruction-Guided Local Editing
- UniHuman: A Unified Model for Editing Human Images in the Wild
- AppAgent: Multimodal Agents as Smartphone Users
- Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models
- MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance
- CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
- Latent Space Editing in Transformer-Based Flow Matching
- VidToMe: Video Token Merging for Zero-Shot Video Editing
- Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation
- SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds
- Clockwork Diffusion: Efficient Generation With Model-Step Distillation
- SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
- Inversion-Free Image Editing with Natural Language
- BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
- Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training
- Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D
- Adversarial Score Distillation: When score distillation meets GAN
- Motion-Conditioned Image Animation for Video Editing
- Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing
- On Exact Inversion of DPM-Solvers
- COLE: A Hierarchical Generation Framework for Multi-Layered and Editable Graphic Design
- LEDITS++: Limitless Image Editing using Text-to-Image Models
- Text-Driven Image Editing via Learnable Regions
- Self-correcting LLM-controlled Diffusion Models
- EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models
- EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods
- ImagenHub: Standardizing the evaluation of conditional image generation models
- TokenFlow: Consistent Diffusion Features for Consistent Video Editing
- FreeDrag: Feature Dragging for Reliable Point-based Image Editing
- DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models
- LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance
- DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
- MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
- MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing
-
[CVPR 2025] OmniFlow: Any‑to‑Any Generation with Multi‑Modal Rectified Flows
-
[CVPR 2025] TokenFlow: Unified image tokenizer for multimodal understanding and generation
-
[CVPR 2025] UNIC‑Adapter: Unified Image‑instruction Adapter with Multi‑modal Transformer for Image Generation
-
[CVPR 2025] MergeVQ: A Unified Framework for Visual Generation and Representation with Token Merging and Quantization
-
[ICLR 2025] Show‑o: One Single Transformer to Unify Multimodal Understanding and Generation
-
[ICLR 2025] Transfusion: Predict the Next Token and Diffuse Images with One Multi‑Modal Model
-
[CVPRW 2025] UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
- Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
- TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
- Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
- UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning
- Bifrost‑1: Bridging Multimodal LLMs and Diffusion Models with Patch‑level CLIP Latents
- Qwen‑Image Technical Report
- X‑Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
- Ovis‑U1 Technical Report
- UniCode$^2$: Cascaded Large‑scale Codebooks for Unified Multimodal Understanding and Generation
- OmniGen2: Exploration to Advanced Multimodal Generation
- Vision as a Dialect: Unifying Visual Understanding and Generation via Text‑Aligned Representations
- UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
- UniWorld: High‑Resolution Semantic Encoders for Unified Visual Understanding and Generation
- Pisces: An Auto‑regressive Foundation Model for Image Understanding and Generation
- DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
- UniTok: A Unified Tokenizer for Visual Generation and Understanding
- QLIP: Text‑Aligned Visual Tokenization Unifies Auto‑Regressive Multimodal Understanding and Generation
- TBAC‑UniImage: Unified Understanding and Generation by Ladder‑Side Diffusion Tuning
- UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
- Ming‑Omni: A Unified Multimodal Model for Perception and Generation
- OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation
- BLIP3‑o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset
- Ming‑Lite‑Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
- Nexus‑Gen: A Unified Model for Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space
- MetaQueries: Transfer between Modalities with MetaQueries
- Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation
- MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
- UniFluid: Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
- OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
- Janus‑Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
- VARGPT‑v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
- ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
- SemHiTok: A Unified Image Tokenizer via Semantic‑Guided Hierarchical Codebook for Multimodal Understanding and Generation
- VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
- BAGEL: Emerging Properties in Unified Multimodal Pretraining
- Mogao: An Omni Foundation Model for Interleaved Multi‑Modal Generation
- M2‑omni: Advancing Omni‑MLLM for Comprehensive Modality Support with Competitive Performance
-
[CVPR 2024] TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
-
[CVPR 2024] Unified‑IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio and Action
-
[CVPR 2024] Emu2: Generative Multimodal Models are In‑Context Learners
-
[ICLR 2024] LWM: World Model on Million‑Length Video And Language With Blockwise RingAttention
-
[ICLR 2024] VILA‑U: a Unified Foundation Model Integrating Visual Understanding and Generation
-
[ICLR 2024] DreamLLM: Synergistic Multimodal Comprehension and Creation
-
[ICLR 2024] LaVIT: Unified Language‑Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
-
[ICLR 2024] SEED‑LLaMA: Making LLaMA SEE and Draw with SEED Tokenizer
-
[ICML 2024] Video‑LaVIT: Unified Video‑Language Pre‑training with Decoupled Visual‑Motional Tokenization
- SynerGen‑VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
- Liquid: Language Models are Scalable and Unified Multi‑modal Generators
- Orthus: Autoregressive Interleaved Image‑Text Generation with Modality‑Specific Heads
- MMAR: Towards Lossless Multi‑Modal Auto‑Regressive Probabilistic Modeling
- Emu3: Next‑Token Prediction is All You Need
- ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image‑Text Generation
- Chameleon: Mixed‑Modal Early‑Fusion Foundation Models
- MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
- ILLUME: Illuminating Your LLMs to See, Draw, and Self‑Enhance
- PUMA: Empowering Unified MLLM with Multi‑granular Visual Generation
- Mini‑Gemini: Mining the Potential of Multi‑modality Vision Language Models
- MM‑Interleaved: Interleaved Image‑Text Generative Modeling via Multi‑modal Feature Synchronizer
- Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
- MUSE‑VL: Modeling Unified VLM through Semantic Discrete Encoding
- LMFusion: Adapting Pretrained Language Models for Multimodal Generation
- MonoFormer: One Transformer for Both Diffusion and Autoregression
- JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
- Spider: Any‑to‑Many Multimodal LLM
- MIO: A Foundation Model on Multimodal Tokens
- X‑VILA: Cross‑Modality Alignment for Large Language Model
- AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
QuenithAI is a professional organization composed of top researchers, dedicated to providing high-quality 1-on-1 research mentoring for university students worldwide. Our mission is to help students bridge the gap from theoretical knowledge to cutting-edge research and publish their work in top-tier conferences and journals.
Maintaining this Awesome Text-to-Image Generation
list requires significant effort, just as completing a high-quality paper requires focused dedication and expert guidance. If you're looking for one-on-one support from top scholars on your own research project, to quickly identify innovative ideas and make publications, we invite you to contact us ASAP.
➡️ Contact us via WeChat or E-mail to start your research journey.
「应达学术」(QuenithAI) 是一家由顶尖研究者组成,致力于为全球高校学生提供高质量1V1科研辅导的专业机构。我们的使命是帮助学生培养出色卓越的科研技能,在顶级会议和期刊上发表自己的成果。
维护一个GitHub调研仓库需要巨大的精力,正如完成一篇高质量的论文一样,离不开专注的投入和专业的指导。如果您希望在自己的研究项目中,获得来自顶尖学者的一对一支持,我们诚邀您与我们取得联系。
➡️ 欢迎通过 微信 或 邮件 联系我们,开启您的科研之旅。
Contributions are welcome! Please see our Contribution Guidelines for details on how to add new papers, correct information, or improve the repository.