このリポジトリには、ゼロからトレーニングされた安定した拡散モデルが含まれており、継続的に更新されます。 新しいチェックポイント。次の一覧は、現在使用可能なすべてのモデルの概要を示しています。詳細は近日公開予定です。
2022年12月7日
バージョン 2.1
xformers
ATTN_PRECISION=fp16 python <thescript.py>
2022年11月24日
バージョン 2.0
768x768の解像度での新しい安定拡散モデル(安定拡散2.0-v)。U-Netのパラメータ数は1.5と同じですが、テキストエンコーダとしてOpenCLIP-ViT/Hを使用し、最初からトレーニングされます。SD 2.0-vは、いわゆるv予測モデルです。
上記のモデルは、512x512の画像で標準のノイズ予測モデルとして学習されたSD 2.0ベースから微調整されており、利用可能でもあります。
x4 アップスケーリングの潜在テキスト誘導拡散モデルを追加しました。
SD 2.0ベースから微調整された新しい深度ガイド付き安定拡散モデル。このモデルは、MiDaSを介して推定された単眼深さ推定値に基づいて調整されており、構造保存img2imgおよび形状条件付き合成に使用できます。
SD 2.0ベースから微調整されたテキストガイド付きインペインティングモデル。
元のリポジトリに従い、モデルからサンプリングするための基本的な推論スクリプトを提供します。
オリジナルの安定拡散モデルは、CompVisおよびRunwayMLとのコラボレーションで作成され、次の作業に基づいて構築されています。
潜在拡散モデル
を用いた高解像度画像合成 ロビン・ロンバッハ*、アンドレアス・ブラットマン*、ドミニク・ローレンツ、パトリック・エッサー、ビョルン・オマー
CVPR '22 口頭|GitHub |arXiv |プロジェクトページ
そして他の多く。
安定拡散は、潜在的なテキストから画像への拡散モデルです。
既存の潜在拡散環境を更新するには、
conda install pytorch==1.12.1 torchvision==0.13.1 -c pytorch pip install transformers==4.19.2 diffusers invisible-watermark pip install -e .
GPUの効率と速度を向上させるには、 Xformersライブラリをインストールすることを強くお勧めします。
CUDA 11.4を搭載したA100でテスト済み。 インストールには、nvccとgcc / g ++のやや新しいバージョンが必要です。
export CUDA_HOME=/usr/local/cuda-11.4
conda install -c nvidia/label/cuda-11.4.0 cuda-nvcc
conda install -c conda-forge gcc
conda install -c conda-forge gxx_linux-64=9.5.0
Then, run the following (compiling takes up to 30 min).
cd ..
git clone https://github.com/facebookresearch/xformers.git
cd xformers
git submodule update --init --recursive
pip install -r requirements.txt
pip install -e .
cd ../stablediffusion
Upon successful installation, the code will automatically default to memory efficient attention
for the self- and cross-attention layers in the U-Net and autoencoder.
General DisclaimerStable Diffusion models are general text-to-image diffusion models and therefore mirror biases and (mis-)conceptions that are present
in their training data. Although efforts were made to reduce the inclusion of explicit pornographic material, we do not recommend using the provided weights for services or products without additional safety mechanisms and considerations.
The weights are research artifacts and should be treated as such.
Details on the training procedure and data, as well as the intended use of the model can be found in the corresponding model card.
The weights are available via the StabilityAI organization at Hugging Face under the CreativeML Open RAIL++-M License.
Stable Diffusion v2Stable Diffusion v2 refers to a specific configuration of the model
architecture that uses a downsampling-factor 8 autoencoder with an 865M UNet
and OpenCLIP ViT-H/14 text encoder for the diffusion model. The SD 2-v model produces 768x768 px outputs.
Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
5.0, 6.0, 7.0, 8.0) and 50 DDIM sampling steps show the relative improvements of the checkpoints:
Text-to-Image
Stable Diffusion 2 is a latent diffusion model conditioned on the penultimate text embeddings of a CLIP ViT-H/14 text encoder.
We provide a reference script for sampling.
Reference Sampling ScriptThis script incorporates an invisible watermarking of the outputs, to help viewers identify the images as machine-generated.
We provide the configs for the SD2-v (768px) and SD2-base (512px) model.
First, download the weights for SD2.1-v and SD2.1-base.
To sample from the SD2.1-v model, run the following:
python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768
To sample from the base model, use
python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/model.ckpt/> --config <path/to/config.yaml/>
By default, this uses the DDIM sampler, and renders images of size 768x768 (which it was trained on) in 50 steps.
Empirically, the v-models can be sampled with higher guidance scales.
Note: The inference config for all model versions is designed to be used with EMA-only checkpoints.
For this reason
use_ema=False
is set in the configuration, otherwise the code will try to switch from
non-EMA to EMA weights.
Image Modification with Stable Diffusion
Depth-Conditional Stable DiffusionTo augment the well-established img2img functionality of Stable Diffusion, we provide a shape-preserving stable diffusion model.
Note that the original method for image modification introduces significant semantic changes w.r.t. the initial image.
If that is not desired, download our depth-conditional stable diffusion model and the
dpt_hybrid
MiDaS model weights, place the latter in a folder midas_models
and sample via
python scripts/gradio/depth2img.py configs/stable-diffusion/v2-midas-inference.yaml <path-to-ckpt>
or
streamlit run scripts/streamlit/depth2img.py configs/stable-diffusion/v2-midas-inference.yaml <path-to-ckpt>
This method can be used on the samples of the base model itself.
For example, take this sample generated by an anonymous discord user.
Using the gradio or streamlit script
depth2img.py
, the MiDaS model first infers a monocular depth estimate given this input,
and the diffusion model is then conditioned on the (relative) depth output.
This model is particularly useful for a photorealistic style; see the examples.
For a maximum strength of 1.0, the model removes all pixel-based information and only relies on the text prompt and the inferred monocular depth estimate.
Classic Img2ImgFor running the "classic" img2img, use
python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img <path-to-img.jpg> --strength 0.8 --ckpt <path/to/model.ckpt>
and adapt the checkpoint and config paths accordingly.
Image Upscaling with Stable Diffusion
After downloading the weights, run
python scripts/gradio/superresolution.py configs/stable-diffusion/x4-upscaling.yaml <path-to-checkpoint>
or
streamlit run scripts/streamlit/superresolution.py -- configs/stable-diffusion/x4-upscaling.yaml <path-to-checkpoint>
for a Gradio or Streamlit demo of the text-guided x4 superresolution model.
This model can be used both on real inputs and on synthesized examples. For the latter, we recommend setting a higher
noise_level
, e.g. noise_level=100
.
Image Inpainting with Stable Diffusion
Download the SD 2.0-inpainting checkpoint and run
python scripts/gradio/inpainting.py configs/stable-diffusion/v2-inpainting-inference.yaml <path-to-checkpoint>
or
streamlit run scripts/streamlit/inpainting.py -- configs/stable-diffusion/v2-inpainting-inference.yaml <path-to-checkpoint>
for a Gradio or Streamlit demo of the inpainting model.
This scripts adds invisible watermarking to the demo in the RunwayML repository, but both should work interchangeably with the checkpoints/configs.
Shout-Outs
img2imgis an application of SDEdit by Chenlin Meng from the Stanford AI Lab.
The code in this repository is released under the MIT License.
The weights are available via the StabilityAI organization at Hugging Face, and released under the CreativeML Open RAIL++-M License License.
@misc{rombach2021highresolution, title={High-Resolution Image Synthesis with Latent Diffusion Models}, author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer}, year={2021}, eprint={2112.10752}, archivePrefix={arXiv}, primaryClass={cs.CV} }