Multi Controlnet Video to Video Show case (EbSynth + Controlnets) [EN/日本語]

40

ENGLISH

SHOWCASE IS THE BOTTOM OF THIS ARTICLE

Mostly translated by deepl

Information added by editing can be found at the end of the page.

Getting Started

Note: This is not a complete guide that explains every single step of the process.

I will write down a brief description of what kind of video to video can be done with Stable diffusion using currently available tools and about how to do it.

Criticisms, comments, and questions are welcome.

The method I am going to show you is how to do video2video with max denoise using Multi Controlnets.

Tools

  1. EbSynth:.

    This allows for the creation of smooth videos with fewer frames, using AI to map images to specified frames in the target video.

  2. ebsynth utility, an extension of auto1111.

    This facilitates setup for EbSynth.

  3. FlowFrames:

    Smoother videos using frame-to-frame completion AI models such as RIFE and DAIN.

  4. stable diffusion automatic1111 UI + controlnets

Links for the tools is below

EbSynth: https://ebsynth.com

EbSynth Utility (auto1111 extension): https://github.com/s9roll7/ebsynth_utility

FlowFrames: https://nmkd.itch.io/flowframes

And you need to install controlnet extension to auto1111.

controlnet's models: https://huggingface.co/lllyasviel/ControlNet-v1-1/tree/main

temporalnet used as controlnet: https://huggingface.co/CiaraRowles/TemporalNet

You need to change the value in Setting -> contrlnet -> unit number to use multi controlnets.

Itinerary

The process can be roughly divided into the following steps

  1. Split the target video into frames using EbSynth Utility and create a mask for background transparency.

  2. Create an image with img2img for reference.

  3. Use the image generated in step 2 for controlnet reference and img2img batch along with other controlnets.

  4. Synth with EbSynth

  5. EbSynth Utility to put together the video from each frame generated in step 4, then make background transparent if you need.

If you use the clip seg from the Configuration menu of Stage 1 of the EbSynth Utility, you can mask only the face or only the torso.

In ebsynth utility stages, skip stage3 and do img2img by yourself to use multi controlnet.

Time

Time for 8~15 seconds video with intel 12900KF + RTX 4090 24GB are:

img2img batch processing takes 30 to 60 minutes

(Time required to img2img about 1/10 of the total number of keyframes)

Synthesize time for EbSynth is 10~15 minutes

and the less time for the other steps together takes about 90 minutes max.

All you have to do is one img2img in step 2 and the rest is just following the tool, pressing buttons, etc. and waiting for the CPU and GPU to take care of it.

img2img process

To keep the workload low, you can choose only one img2img image for yourself. Do img2img the first or a image you like after extracting the key frame in Stage 2 of EbSynty Utility. At this point, adjust the Controlnet parameters and prompts to get the desired clothing, hair, and face.

Once you have the desired single frame, simply batch img2img, remembering to batch the input image for canny, depth, openpose, etc., rather than single image. In contrast, reference should be a Single Image. If the video is little motion or the frames are close together, you can add a temporalnet as a controlnet. Check the "loop back" box when using temporalnet.

For reference, Style Fidelity should be set to 0.

If you want to apply lora to the face, use Adetailer (https://github.com/Bing-su/adetailer ), which will automate the inpainting of the face for each batch img2img image.

If you are thinking of uploading to tiktok, the resolution should be 576x1024.

If you want to improve the quality, simply select appropriate img2imged images and regenerate others. (but you need to take more your time.)

If you know of better parameters, please let me know.

SHOWCASE

I did img2img for about 1/10 of the total number of taget video frames.

The following works are not selected after the img2img batch process.

If you want to see it in high resolution, please see tiktok. https://www.tiktok.com/@suzur420

PLS GO TO BELOW JP version to see showcase

日本語

始めに

[注意]これは一つ一つの手順を解説する完璧なガイドではありません。

現在公開されているツールを使ってStable diffusionでどんなVideo to Videoをできるのかということとそのやり方を簡単に記します。

批判・意見・質問は歓迎します。

これから紹介する方法はMulti Controlnetsを使用してデノイズを最大にしてvideo2videoをする方法です

使用したツール

  1. EbSynth:

    これは少ないフレーム数で滑らかな動画を作成することを可能にします。AIによって画像を対象動画の指定フレームに対してマッピングします。

  2. auto1111の拡張のebsynth utility:

    これはEbSynthのためのセットアップを容易にします。

  3. FlowFrames:

    RIFEやDAINと言ったフレーム間補完AIモデルを使用して動画をより滑らかにします。

  4. stable diffusion automatic1111 UI + controlnets

各ツールのリンクは

EbSynth: https://ebsynth.com

EbSynth Utility (auto1111 extension): https://github.com/s9roll7/ebsynth_utility

FlowFrames: https://nmkd.itch.io/flowframes

それとauto1111 UIにcontrolnet extensionもインストールしておいてください。

controlnetのモデルはhttps://huggingface.co/lllyasviel/ControlNet-v1-1/tree/main

controlnetとして使うtemporalnetはhttps://huggingface.co/CiaraRowles/TemporalNet

Multi Controlnetにするにはauto1111の設定からSetting -> contrlnet -> unit numberを変更してください。

行程

行程としては大きく分けて

  1. EbSynth Utilityで対象ビデオをフレームに分割と背景透過のためのマスク生成。

  2. Referenceのためにimg2imgで画像を一枚作る。

  3. その画像をReferenceにして他のControlnetと合わせてimg2img Batch.

  4. EbSynthでSynth

  5. EbSynth Utilityで4で生成された各フレームからビデオにまとめてから背景透過

EbSynth UtilityのStage 1のConfigurationからclip segを使用すれば顔だけや胴体だけに対してもマスクを取ることができます。

ebsynth utilityではstage3をスキップしてください。multi controlnetを使用するには自分でimg2imgを行ってください。

かかる時間

8~15秒の動画にかかる時間はintel 12900KF + RTX 4090 24GBで

  1. img2imgのbatch処理はを30分から60分

    (総フレーム数に対して約1/10の量のキーフレームをimg2imgするのにかかる時間)

  2. EbSynthのSynthesize時間は10~15分

と他のステップの少ない時間を合わせて最大90分くらいかかります。

自分がするべき作業はステップ2の一枚のimg2imgだけで他はツールに沿ってボタン等を押してCPUやGPUに任せて待つだけです。

IMG2IMGプロセス

低作業量を保つために自分で選ぶimg2imgの画像は一枚のみです。EbSynty UtilityのStage 2でkey frameを抽出したあと, 一枚目または全体が映った画像をimg2imgします。このときControlnetのパラメータやプロンプトを調節しながら希望の服装や髪や顔を出してください。

希望の1フレームができたらあとはbatch img2imgをするだけです。cannyやdepth, openposeなどは入力画像をSingle ImageではなくBatchにするのを忘れないでください。対してreferenceはSingle Imageにしてください。動きが少ない場合やフレームの距離が近い場合はcontrolnetとしてtemporalnetを追加して使ってもよいです。temporalnetを使用する場合はSingle Imageで最初に来るべきフレームを入力画像にしてloop back項目にチェックを入れてください。

より良いパラメータを知っている場合は教えてください。

ReferenceはStyle Fidelityは0にしたほうがよいです。

顔にloraを当てたい場合はAdetailer(https://github.com/Bing-su/adetailer )を使ってください。batch img2imgしたそれぞれの画像の顔のinpaintを自動化してくれます

tiktokにアップロードすることを考えている場合は解像度は576x1024にしてください.

時間をかけてよりよいクオリティにしたい場合はimg2img後のkey frameを取捨選択してください。

SHOWCASE

総フレーム数に対して約1/10のキーフレーム数に対してimg2imgをしました。

以下のものはimg2img batchプロセス後に取捨選択はしていません。

高解像度で見たい場合はtiktokを見てください。https://www.tiktok.com/@suzur420

使ったパラメータ

model: henmix (for real), meina mix (for anime)

sampler: euler a 30 steps

denoising strength: 1.0

controlnet name, weight, steps,

depth_zoe, 0.8, 0~0.6

openpose_full, 0.8 0~0.3

reference_adain+attn, 1, 0.3, Style Fidelity 0.5

openpose, 0.9, 0~0.6

openpose_faceonly, 0.9, 0.6~0.95

reference_only, 1, 0~0.2

temporalnet, 0.3, 0~0.5

Denoising strength 1.0

denoise multipler 0.5

sampler: Euler a 40steps

referebce_adain Pixel Perfect weight 1 step:0~0.2 first image

lineart_realistic Pixel Perfect weight 0.8 step:0~0.8 batch

openpose_full Pixel Perfect weight:0/8 step:0~0.8 batch

Adetailer ( lora: koreandolllikeness)

Steps: 30, Sampler: Euler a, CFG scale: 7, Seed: 2402960993, Size: 576x1024, Model hash: 77b7dc4ef0, Model: meinamix_meinaV10, Denoising strength: 1, Clip skip: 2, ADetailer model: face_yolov8n.pt, ADetailer confidence: 0.3, ADetailer dilate/erode: 4, ADetailer mask blur: 4, ADetailer denoising strength: 0.3, ADetailer inpaint only masked: True, ADetailer inpaint padding: 32, ADetailer ControlNet model: control_v11p_sd15_openpose [cab727d4], ADetailer version: 23.6.2, ControlNet 0: "preprocessor: lineart_realistic, model: control_v11p_sd15_lineart [43d4be0d], weight: 1, starting/ending: (0, 0.22), resize mode: Crop and Resize, pixel perfect: True, control mode: Balanced, preprocessor params: (512, 100, 200)", ControlNet 1: "preprocessor: openpose_full, model: control_v11p_sd15_openpose [cab727d4], weight: 0.7, starting/ending: (0, 0.9), resize mode: Crop and Resize, pixel perfect: True, control mode: Balanced, preprocessor params: (512, -1, -1)", ControlNet 4: "preprocessor: depth_zoe, model: control_v11f1p_sd15_depth [cfd03158], weight: 0.8, starting/ending: (0, 0.8), resize mode: Crop and Resize, pixel perfect: True, control mode: Balanced, preprocessor params: (512, -1, -1)", Noise multiplier: 0.5, Version: v1.3.2

reference only, 1, 0~1, SF 0

temporalnet, 0.1, 0~0.1

[: (anime: 1.2), best quality, masterpiece, sharp focus : 1], a young woman, [: green hair: 3], pink skirt,

Negative prompt: [: real, photo, photography, realistic, dslr, worst quality, low quality, normal quality: 1], (disfigured, mutation, deformed)

Steps: 30, Sampler: Euler a, CFG scale: 7, Seed: 1961110465, Size: 576x1024, Model hash: cbfba64e66, Model: CounterfeitV30_v30, Denoising strength: 1, Clip skip: 2, ADetailer model: face_yolov8n.pt, ADetailer confidence: 0.3, ADetailer dilate/erode: 4, ADetailer mask blur: 4, ADetailer denoising strength: 0.3, ADetailer inpaint only masked: True, ADetailer inpaint padding: 32, ADetailer ControlNet model: control_v11p_sd15_openpose [cab727d4], ADetailer version: 23.6.2, ControlNet 0: "preprocessor: openpose_full, model: control_v11p_sd15_openpose [cab727d4], weight: 1.0, starting/ending: (0.0, 1.0), resize mode: ResizeMode.INNER_FIT, pixel perfect: True, control mode: ControlMode.BALANCED, preprocessor params: (-1, -1, -1)", ControlNet 1: "preprocessor: openpose_full, model: control_v11p_sd15_openpose [cab727d4], weight: 0.8, starting/ending: (0, 0.8), resize mode: Crop and Resize, pixel perfect: False, control mode: Balanced, preprocessor params: (512, -1, -1)", Noise multiplier: 0.5, Version: v1.3.2

reference_only 1,0~1, SF 0

temporalnet, 0.1, 0~0.1

是非このガイドを参考にしてよりよい動画を作成できることを祈っています。

よりよい方法やツールを見つけた場合は教えてください。

I hope you will find this guide useful and hope that it will help you create better videos.

If you find a better method or tool, please let me know.

[edit: 2023-06-25]

To obtain good results with less effort, choosing good videos is a key.

The points are following:

  • less compricated movement, less camerawork

  • less movement of hands before its body ( )

(1 is for reference_only)

(2 is for EbSynth)

40

Comments

NCO2's Avatar

I had a problem with changing the clothes when using EbSynth, thanks for this post, the problem is resolved. Thanks suzur420

suzur420's Avatar

@NCO2 I'm glad I could be of help :)

lon9's Avatar

@suzur420 Do you apply any post processing like Topaz Ai or Davinci resolve studio? I can't stabilize my videos as good as yours.

suzur420's Avatar

@lon9 hi thx for your comment. Post processing tool that i used after Ebsynth is FlowFrames: https://nmkd.itch.io/flowframes as in the Tools section.
What do you mean by "stabilize"? If you're talking about background, its ebsynth utility's stage 8.

lon9's Avatar

Thank you for the reply. I mean face position not background. Do you have any techniques for stabilizing face position?

suzur420's Avatar

@lon9 pls use openpose_full or openpose_face or openpose_face_only as preprocessor, which detects face position. And if youre using adetailer, lower inpaint denoising strength ( below 0.4)

lon9's Avatar

@suzur420 I used Adetailer as default settings, so I must try it, thnks.