stablediffusion - 潜在拡散モデルを用いた高解像度画像合成

(High-Resolution Image Synthesis with Latent Diffusion Models)

Created at: 2022-11-24 07:59:50
Language: Python
License: MIT



このリポジトリには、ゼロからトレーニングされた安定した拡散モデルが含まれており、継続的に更新されます。 新しいチェックポイント。次の一覧は、現在使用可能なすべてのモデルの概要を示しています。詳細は近日公開予定です。



バージョン 2.1

  • 新しい安定拡散モデル(安定拡散2.1-v、ハギングフェイス)768x768解像度と512x512解像度(安定拡散2.1ベースハギングフェイス)は、どちらも2.0と同じ数のパラメータとアーキテクチャに基づいており、2.0で微調整されています。 データセット。 デフォルトでは、モデルのアテンション操作は、がインストールされていないときに最大精度で評価されます。fp16を有効にするには(v2.1モデルのバニラアテンションモジュールで数値が不安定になる可能性があります)、スクリプトを次のように実行します。
    ATTN_PRECISION=fp16 python <>


バージョン 2.0



を用いた高解像度画像合成 ロビン・ロンバッハ*、アンドレアス・ブラットマン*、ドミニク・ローレンツパトリック・エッサービョルン・オマー
CVPR '22 口頭|GitHub |arXiv |プロジェクトページ





conda install pytorch==1.12.1 torchvision==0.13.1 -c pytorch
pip install transformers==4.19.2 diffusers invisible-watermark
pip install -e .


GPUの効率と速度を向上させるには、 Xformersライブラリをインストールすることを強くお勧めします。

CUDA 11.4を搭載したA100でテスト済み。 インストールには、nvccとgcc / g ++のやや新しいバージョンが必要です。

export CUDA_HOME=/usr/local/cuda-11.4
conda install -c nvidia/label/cuda-11.4.0 cuda-nvcc
conda install -c conda-forge gcc
conda install -c conda-forge gxx_linux-64=9.5.0

Then, run the following (compiling takes up to 30 min).

cd ..
git clone
cd xformers
git submodule update --init --recursive
pip install -r requirements.txt
pip install -e .
cd ../stablediffusion

Upon successful installation, the code will automatically default to memory efficient attention for the self- and cross-attention layers in the U-Net and autoencoder.

General Disclaimer

Stable Diffusion models are general text-to-image diffusion models and therefore mirror biases and (mis-)conceptions that are present in their training data. Although efforts were made to reduce the inclusion of explicit pornographic material, we do not recommend using the provided weights for services or products without additional safety mechanisms and considerations. The weights are research artifacts and should be treated as such. Details on the training procedure and data, as well as the intended use of the model can be found in the corresponding model card. The weights are available via the StabilityAI organization at Hugging Face under the CreativeML Open RAIL++-M License.

Stable Diffusion v2

Stable Diffusion v2 refers to a specific configuration of the model architecture that uses a downsampling-factor 8 autoencoder with an 865M UNet and OpenCLIP ViT-H/14 text encoder for the diffusion model. The SD 2-v model produces 768x768 px outputs.

Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 DDIM sampling steps show the relative improvements of the checkpoints:

sd evaluation results


txt2img-stable2 txt2img-stable2

Stable Diffusion 2 is a latent diffusion model conditioned on the penultimate text embeddings of a CLIP ViT-H/14 text encoder. We provide a reference script for sampling.

Reference Sampling Script

This script incorporates an invisible watermarking of the outputs, to help viewers identify the images as machine-generated. We provide the configs for the SD2-v (768px) and SD2-base (512px) model.

First, download the weights for SD2.1-v and SD2.1-base.

To sample from the SD2.1-v model, run the following:

python scripts/ --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768  

or try out the Web Demo: Hugging Face Spaces.

To sample from the base model, use

python scripts/ --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/model.ckpt/> --config <path/to/config.yaml/>  

By default, this uses the DDIM sampler, and renders images of size 768x768 (which it was trained on) in 50 steps. Empirically, the v-models can be sampled with higher guidance scales.

Note: The inference config for all model versions is designed to be used with EMA-only checkpoints. For this reason

is set in the configuration, otherwise the code will try to switch from non-EMA to EMA weights.

Image Modification with Stable Diffusion


Depth-Conditional Stable Diffusion

To augment the well-established img2img functionality of Stable Diffusion, we provide a shape-preserving stable diffusion model.

Note that the original method for image modification introduces significant semantic changes w.r.t. the initial image. If that is not desired, download our depth-conditional stable diffusion model and the

MiDaS model weights, place the latter in a folder
and sample via

python scripts/gradio/ configs/stable-diffusion/v2-midas-inference.yaml <path-to-ckpt>


streamlit run scripts/streamlit/ configs/stable-diffusion/v2-midas-inference.yaml <path-to-ckpt>

This method can be used on the samples of the base model itself. For example, take this sample generated by an anonymous discord user. Using the gradio or streamlit script
, the MiDaS model first infers a monocular depth estimate given this input, and the diffusion model is then conditioned on the (relative) depth output.


This model is particularly useful for a photorealistic style; see the examples. For a maximum strength of 1.0, the model removes all pixel-based information and only relies on the text prompt and the inferred monocular depth estimate.


Classic Img2Img

For running the "classic" img2img, use

python scripts/ --prompt "A fantasy landscape, trending on artstation" --init-img <path-to-img.jpg> --strength 0.8 --ckpt <path/to/model.ckpt>

and adapt the checkpoint and config paths accordingly.

Image Upscaling with Stable Diffusion

upscaling-x4 After downloading the weights, run

python scripts/gradio/ configs/stable-diffusion/x4-upscaling.yaml <path-to-checkpoint>


streamlit run scripts/streamlit/ -- configs/stable-diffusion/x4-upscaling.yaml <path-to-checkpoint>

for a Gradio or Streamlit demo of the text-guided x4 superresolution model.
This model can be used both on real inputs and on synthesized examples. For the latter, we recommend setting a higher

, e.g.

Image Inpainting with Stable Diffusion


Download the SD 2.0-inpainting checkpoint and run

python scripts/gradio/ configs/stable-diffusion/v2-inpainting-inference.yaml <path-to-checkpoint>


streamlit run scripts/streamlit/ -- configs/stable-diffusion/v2-inpainting-inference.yaml <path-to-checkpoint>

for a Gradio or Streamlit demo of the inpainting model. This scripts adds invisible watermarking to the demo in the RunwayML repository, but both should work interchangeably with the checkpoints/configs.



The code in this repository is released under the MIT License.

The weights are available via the StabilityAI organization at Hugging Face, and released under the CreativeML Open RAIL++-M License License.


      title={High-Resolution Image Synthesis with Latent Diffusion Models}, 
      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},