Talking Head(?) Anime
from a Single Image 3:
Now the Body Too

Pramook Khungurn

 


The characters are corporate/independent virtual YouTubers. Images and videos in this article are their fan arts.
[footnote]

 

Abstract. I present my third iteration of a neural network system that can animate an anime character, given just only one image of it. While the previous iteration can only animate the head, this iteration can animate the upper body as well. In particular, I added the following three new types of movements.

Body rotation
around the y-axis
Body rotation
around the z-axis
Breathing

With the new system, I updated an existing tool of mine that can transfer human movement to anime characters. The following is what it looks like with the expanded capabilities.

I also experimented with making the system smaller and faster. I was able to reduce significatly reduce its memory requirement (18 times reduction in size and 3-4 times reduction in RAM usage) and made it slightly faster while incurring little deterioration in image quality.

 

1   Introduction

Since 2018, I have been a fan of the virtual YouTubers (VTuber). In fact, I like them so much that, starting from 2019, I have been doing two personal AI projects whose aims were to make it much easier to become a VTuber. In the 2021 version of the project, I created a neural network system that can animate the face of any existing anime character, given only its head shot as input. My system lets users animate characters without having to create controllable models (either 3D models by using softare such as 3ds Max, Maya or Blender, or 2.5D ones by using software such as Live2D or Spine) beforehand. It has the potential to greatly reduce the cost of avatar creation and character animation.

While my system can rotate the face and generate rich facial expressions, it is still far from practical as a streaming and/or content creation tool. One reason is that all movement is limited to the head. A typical VTuber model, however, can rotate the upper body to some extent. It also features a breathing motion in which the character's chest or the entire upper body would rhythmically wobble up and down even if the human performer is not actively controlling the character.

The system also has another major problem: it is resource intensive. It is about 600 MB in size and requires a powerful desktop GPU to run. In order to enable usage on less powerful computers, I must optimize the system's size, memory usage, and speed.

In this article, I report my attempt to address the above two problems.

For the problem of upper body movement, I extended my latest system by adding 3 types of movements: rotation of the hip around the y-axis, rotation of the hip around the z-axis, and breathing. The new system can now animate the upper body in addition to the head, making its features close to those of professionally-made VTuber models. I accomplished this without significantly increasing the network size or processing time.

For the problem of high resource requirements, I experimented with two techniques to optimize my neural networks. The first is using depthwise separable convolutions

[Sifre 2014] instead of standard convolution. The second is representing numbers with the 16-bit floating point type (aka half) instead of the 32-bit one (aka float). Using both techniques, I was able to reduce the system's size in bytes by a factor of 18 and GPU RAM usage by a factor of 3 to 4. The techniques also provided a small improvement to speed.

2   Background

I created a deep neural network system whose purpose was to animate the head of an anime character. The system takes as input (1) an image of the character's head in the front facing configuration with its eyes wide open and (2) a 42-dimensional vector called the pose vector that specifies the character's desired pose. It then proceeds to output another image of the same character after being posed accordingly. The system can rotate the character's face by up to 15 around three axes. Moreover, it can change the shapes of the eyebrows, eyelids, irises, and mouth, allowing the character to show various emotions and convincingly imitate (Japanese) speech.

The system is largely decomposed into two main subnetworks. The face morpher is tasked with changing the character's facial expression, and its design is documented in the write-up of the 2021 project. The face rotator is tasked with rotating the face, and its design is available in the write-up of the 2019 project. Figure 2.1 illustrates how the networks are put together.

Figure 2.1 An overview of my neural network system.

For this article, the face rotator is especially relevant because it is the network that I have to redesign in order to expand the system's capability. The network itself is made up of two subnetworks. The two-algorithm face rotator uses two image tranformation techniques to generate images of the character's rotated face, and the combiner merges the two generated images to create the final output.

Figure 2.2 The face rotator.

The two image tranformation techniques are:

When the face is rotated by a few degrees, most changes to the input image can be thought of as moving existing pixels to new locations. Warping can thus handle these changes very well, and the generated image would be sharp because existing pixels would be faithfully reproduced. Nevertheless, warping cannot generate new pixels, which are needed when unseen parts of the head become visible after rotation. Partial image change, on the other hand, can generate new pixels from scratch, but they tend to be blurry. By combining both approaches

[footnote], we can use pixels hallucinated by partial image change to fill areas that warping cannot handle, thus getting the best of both worlds.

3   Moving the Body

In this large section, I discuss how I extended my 2021 system so that it can move the body as well. I will start by defining exactly the problem I would like to solve (Section 3.1). Then, I will give a brief overview of the whole system. In particular, I will discuss which networks from the previous projects I reused and which I created anew (Section 3.2). Next, I will elaborate on how I generated the datasets to train the new networks (Section 3.3). I will then describe the networks' architectures and training procedures (Section 3.4), and lastly I will evaluate the networks' performance (Section 3.5).

3.1   Problem Specification

As with the previous version of the system, the new version in this article takes as input an image of a character and a pose vector. The image is now of resolution 512×512 in order to fully show the upper body. The character should be standing approximately upright, and the head should be roughly contained in the 128×128 box in the middle of the top half of the image.

Figure 3.1.1 A valid input to the new version of the system. The character is Kizuna AI (© Kizuna AI).

The character's eyes must be wide open, but the mouth can be either completely closed or wide open. However, while the character's head must be front facing in the old version, the new version relaxes this constraint. The head and the body can be rotated to by a few degrees. The arms can be posed rather freely, but they should generally be below and far from the face. Allowing these variations makes the system more versatile because it is hard to find images of anime characters in the wild whose face is perfectly front facing and whose body is perfectly upright.

Figure 3.1.2 Examples of valid input images to the system.

The input image must have an alpha channel. Moreover, for every pixel that is not a part of the character, the RGBA value must be (0,0,0,0).

The pose vector now has 45 dimensions, and you can see the semantics of each parameter in Appendix A. 42 parameters have mostly the same semantics as those in the last version of the system. The only changes from the old version are the ranges of the parameters for head rotation around the y- and z-axis. In the old version, these parameters correspond to rotation angles in the range [15,15]. In the new version, the range shrinks to [10,10] for a reason that will momentarily become apparent.

There are three new parameters, and they control the body.

With the above three parameters, it becomes possible to move a character's upper body like how typical VTubers move theirs.

Note that I previously mentioned that I reduced the range of the head rotation around the y-axis and the z-axis from [15,15] to [10,10]. I did so because rotating the hip causes the face to move as well, and I would like to preserve the [15,15] range in which the face can be oriented. That is, because the hip can be rotated by angles in the range [5,5], I reduced the face's angle range to [10,10] because 10+5=15.

Lastly, let us recall the output's specification. After being given an image of a character and a pose vector, the system must produce a new image of the same character, posed according to the pose vector.

3.2   System Overview

Figure 3.2.1 gives an overview of the new version of the system. It is similar to the old one (Figure 2.1), but now it deals with the upper body instead of just only the face. It still has two steps, and the first step still modifies the character's facial expression. For this step, I reuse the face morpher network that is the centerpiece of the previous year's project. The second step must not only rotate both the face and the body but also make the character breath, so the old face rotator from 2019 cannot be used. The network for the second step is now called the body rotator, and it must be designed and trained anew.

Figure 3.2.1 An overview of the new version of the system.

3.3   Data

We must now prepare datasets to train the body rotator. Continuing the practice I adopted in previous projects, I created them by rendering 3D models created for the animation software MikuMikuDance, and I reused a collection of around 8,000 MMD models I manually downloaded and annotated. Details on how I created the collection can be found here and here.

A dataset's format must follow the specification of the body rotator's input and output. In particular, the input consists of two objects. One is a 512×512 image of a character whose facial expression has been modified by the face morpher. The other is the part of the pose vector that controls (1) the rotation of the face and body and (2) the breathing motion. This part has 6 dimensions: 3 for face rotation, 2 for body rotation, and 1 for the breathing motion. The output, of course, is another 512×512 image of the same character, but now its pose has been modified according to the 6-dimensional pose vector.

3.3.1   Posing for the Input Image

One main difference between the body rotator and the face rotator from the 2021 project is the character's body pose in the input image. For the face rotator, the character must be in the "rest" pose. In other words, the face must be looking forward and must not be tilted or rotated sideways. Moreover, the body must be perfectly upright. The arms must stretch straight sideways and point diagonally downward. (See Figure 3.3.1.1.) On the other hand, as stated in Section 3.1, the new body rotator must be able to accept variations in the initial pose like in Figure 3.1.2.

Figure 3.3.1.1 The MMD model of Kizuna AI in the rest pose.

This requirement makes data generation harder. For the old version, I only have to render an MMD model without posing it because MMD modelers almost always create their models in the rest pose to begin with. On the contrary, the new version requires an MMD model to be posed twice. It must take a non-rest pose in the input image, and then that pose must be altered according to the pose vector in the output image.

One must then figure out what poses to use in the input images, and my answer is to use those shared by the MMD community. I downloaded pose data in VPD format, created specifically for MMD models from web sites such as Nico Nico and BowlRoll and ended up collecting 4,731 poses in total. However, a pose may not be usable for several reasons.

  1. It is not a standing pose.
  2. The face turns too much sideways, upward, or downward.
  3. After the model is posed, the face or a large part of it is not contained in the middle 128×128 box described in Section 3.1.

I created a tool that allowed me to manually classify whether a pose is usable or not through visual inspection. With it, I identified about 832 usable poses (a yield of 19.1%). You can see the tool in action in the video below.

One way to specify the pose in the input image is to uniformly sample a usable pose from the collection above. However, I felt that using just 832 poses was not diverse enough, so I augmented the sampled pose further. After sampling a pose from the collection, I sample a "rest pose" by sampling the angle the arms should make with the y-axis from the range [12.5,30] and rotating the model's arm accordingly. I then blended the pose from the collection with the rest pose, using a blending factor α sampled from the range [0,1]. This process is depicted in the figure below.

Figure 3.3.1.2 The process to sample a pose to be used in the input image.

Note that a pose of an MMD model is a collection of two types of values.

Blending two poses together thus involves interpolating the above values. More specifically, we perform linear interpolation on the blendshape weights, and spherical linear interpolation (slerp) on the quaternions.

3.3.2   Semantics of the 6 Pose Parameters

In order to generate training examples, one must determine what each component of the pose vector means in terms of MMD models. For example, when the breathing parameter is, say, 0.75, what bone(s) in an MMD model does one modify and what should the modification be? I shall now discuss the semantics of the 6 parameters in turn.

3.3.2.1   The Hip Rotation Parameters

Let us start with the one that is the easiest to describe: the hip y-rotation parameter. In this case, one must rotate two bones around the y-axis by the same angle. One bone is called "upper body" (上半身), and the other is called "lower body" (下半身). According to the specification in Section 3.1, the angle is v×5 where v is the parameter value.

For the semantics of the hip z-rotation parameter, just replace the y-axis with the z-axis.

3.3.2.2   The Breathing Parameter

The semantics of the breathing parameter is more involved as most MMD models do not have bones or morphs for specifically controlling breathing. As a result, I have to define what breathing means on my own, and I chose to modify the translation parameters of a 5 bones in the chest area.

Figure 3.3.2.2.1 Bones modified to enact the breathing motion.

When we inhale, our lungs expands, and it pushes our chest both outward and upward. To simulate this movement, I set the translation parameter of the "upper body" bone to the vector (0,0,v×D) to make the chest protrudes outward and that of the "upper body 2" bone to (0,v×D,0) to make it extends upward. Here, v is the breathing parameter value and D is the maximum displacement, a per-model constant that we shall discuss later. The effect of the modification can be seen in the following video.

However, we can also see that it also has the side effect of making the head and the shoulders move diagonally back and forth. Nevertheless, when we breathe, our head and shoulders rarely move. To keep them stationary, I also set the translation parameters of three remaining bones (i.e., left shoulder, right shoulder, and neck) to (0,v×D,v×D) to cancel the translations of the two upper body bones. The effect of the cancellation can be seen in the video below.

The maximum displacement D is set to 1/64 of the height of the character's head.

When the model is viewed from the front, we can see that the chest moves up and down while the head and the shoulders remain stationary. This movement gives the impression that the character is breathing as we wanted

3.3.2.3   The Head Rotation Parameters

There are three head rotation parameters:

There are no changes to the bones and the axes above. However, I changed how the parameters affect the model's shape.

Typically, when a bone of an 3D model is rotated, bones that are children of that bone and vertices that these bones influence also move. For example, when one rotates the neck bone, vertices on the neck, the whole head, and also the whole hair mass would also rotate with the neck, as can be seen in the video below.

Figure 3.3.2.3.1 The typical result of rotating the neck bone around the z-axis. Notice that the whole hair mass moves like a single rigid body, following the head. The character is Suou Patra (© HoneyStrap), and the 3D model was created by OTUKI.

This behavior, however, makes it very hard for a neural network to animate characters with long hair. First, it must identify correctly which part of the input image is the hair and which part is the body and the arms that are in front of it. This is a hard segmentation problem that must be done 100% correctly. Otherwise, disturbing errors such as severed body parts or clothing might show up. Second, as the hair mass moves, parts that were occluded by the body can become visible, and the network must hallucinate these parts correctly. Note that these difficulties do not exist in the previous version of my system because it could only animate headshots. We cannot see long hair in the input to begin with!

Figure 3.3.2.3.2 TTo generate the video on the right, I used a model to animate the image on the left, but it was trained on a dataset where the whole hair mass moves with the head like in Figure 3.3.2.3.1. In the bottom half of the video, we can see that the details of the hair strands are lost. Moreover, the model seemed to think that the character's hands were a part of the hair, so it cut the fingers off when the hair moved. The character is Enomiya Milk (© Noripro).

While it may be possible to solve the above problems with current machine learning techniques, I realized that it was much easier to avoid them and still generate plausible and beautiful animations. The difficulty in our case comes from long-range dependency: a small movement of the head leads to large movement elsewhere faraway. The situation becomes much easier if hair pixels far from the head were kept stationary.

I thus modified the skinning algorithm for MMD models so that the neck and the head bones can only influence vertices that are not too far below the vertical position of the neck. The new algorithm's effect can be seen in the following video.

Figure 3.3.2.3.3 Hair movement after limiting the influence of the neck and head bones. We can see that the hair mass below the shoulders does not move at all, and this make the system's job much easier.

To recap, the head rotation parameters still correspond to rotating the same bones around the same axes. However, the influence of the these bones is limited to vertices that are not too far below the neck so that head movement cannot cause large movement elsewhere. This greatly simplifies animating characters with very long hair, which are quite common in illustrations in the wild.

3.3.3   Augmenting Renderings with Simulated Neck Shadows

Character illustrations in the wild often depict shadows casted by the head on the neck. We can clearly see that the skin just below the chin is often much darker than the face.

Figure 3.3.3.1 Three drawn characters with neck shadows. The characters are Fushimi Gaku, Hayase Sou, and Honma Himawari. They are © ANYCOLOR, Inc.

However, in 3D models, the neck and face skin often have exactly the same tone. Thus, a neck shadow would be absent if a model is rendered without shadow mapping or other shadow-producing techniques. I chose not to implement such an algorithm because it would require much effort and would greatly complicate my data generation pipeline. As result, my previous datasets do not have neck shadows and are quite different from real-world data.

When a character turns its face upward, the area of the neck previously occluded by the chin would become visible, and the network must hallucinate the pixels there. Ideally, if the neck shadow is present, the hallucinated pixels should have the same color. However, training a neural network with my previous datasets can lead to a problem where these pixels would be brighter than the surrounding shadow because it is fine to use the face skin's color during training. The figure below shows two such failure cases.

Input image After having face
turned upward
Figure 3.3.3.2 Failure cases in which hallucinated neck pixels are brighter than the surrounding neck shadows. The characters are Suzuhara Lulu (top) and Ex Albio (bottom). They are © ANYCOLOR, Inc.

To alleviate the problem without having to implement a full-blown shadow algorithm, I simulated neck shadows by simply rendering the neck under a different lighting configuration than the rest of the body. Like the previous versions of the project, two light sources are present in the scene. As such, when implementing the fragment shader of my renderer, I only had to condition their intensities on whether the fragment being shaded belongs to the neck or not. The result of this rendering method can be seen in the figure below.

(a) (b)
Figure 3.3.3.3 An MMD model rendered (a) conventionally and (b) with simulated neck shadow. The character is Yamakaze from the game Kantai Collection. The 3D model was created by cham.

When generating training examples, we must then provide two sets of lighting intensities so that one can be used to render the body, and the other can be used to render the neck. In the dataset I generated, I sampled the intensities so that the following properties hold:

The sampling method above, I believe, would allow the network to deal with the variety of character illustrations in the wild.

3.3.4   Generating a Training Example

A dataset is a collection of training examples, and a training example in our case is a triple (Iin,p,Iout) where:

[0.450.090.600.060.300.80]
Iin            p            Iout
Figure 3.3.4.1 A training example.

The process of generating the above data items is rather involved. It requires sampling an MMD model, an input pose as in Section 3.3.1, a 6-dimensional pose vector p, and two sets of light intensities as in Section 3.3.3. The input pose and p must then be combined using the specification in Section 3.3.2 to obtain the pose to be used in the output image. The model, the poses, and the lighting configurations are then combined to render Iin and Iout using the rendering algorithm in Section 3.3.3. For completeness, I lay out the complete generation algorithm in the listing below. The reader, however, is advised to skip the description unless they are interested in reproducing it.

Listing 3.3.4.2 Algorithm for generating a training example
  1. A model from my collection of MMD models is sampled. For this step, I made sure that each of the models would have roughly the same number of training examples using it.
  2. The process for determining the input pose, detailed in Section 3.3.1, is invoked. In particular:
    • A pose in VPD format is sampled from the collection of 832 poses.
    • A rest pose is sampled by sampling an arm angle from [12.5,30].
    • A blending factor α is sampled from the range [0,1].
    • The input pose is computed by blending the sampled pose with the rest pose according to α. For later reference, let us call this pose Pin.
  3. A pose vector p is sampled component by component and independenty.
    • The 3 head rotation parameters are each sampled according to the probability distribution I used in the previous version of the project.
    • The 2 body rotation parameters are sampled uniformly from the range [1,1].
    • The breathing parameter is sampled according to a linear distribution p(x) where p(0)=0.3 and p(1)=1.7.
  4. Two sets of light intensities are sampled according to the specification in Section 3.3.3.
  5. The input pose Pin is altered according to the sampled pose vector p. This involves modifying the bones according to the semantics described in Section 3.3.2. Let us called the result of this modification Pout.
  6. The sampled model is posed according to Pin and is then rendered under the sampled lighting intensities as described in Section 3.3.3 to produce the input image Iin.
  7. The sampled model is posed one more time according to Pout and is then rendered to produce the output image Iout.

3.3.5   The Datasets

I followed the same dataset generation process as the previous versions of the projects. Before data generation, I divided the models I downloaded into three groups according to their source materials (i.e., what animes/mangas/games they came from) so that no two groups had models of the same character. I then used the three groups to generate the training, validation, and test datasets. The number of training examples and the number of models used to generate them are given in the table below.

  Training set Validation set Test set
# models used 7,827 79 68
# examples 500,000 10,000 10,000

3.4   Networks

I have just described the datasets for training the body rotator. In this section, I turn to the network's architecture and training process.

3.4.1   Overall Design

In my first attempt to design the body rotator, I reused the face rotator's architecture. There would be two subnetworks. The first one would produce two outputs, and the second one would then combine them into the final output image.

Figure 3.4.1.1 The face rotator's architecture. (This figure is the same as Figure 2.2. It is reproduced here for the reader's convenience.)

The difficulty, however, is the input's size: a 512×512 image is 4 times larger than a 256×256 one. The networks above were designed to work with 256×256 images. Hence, if I use them without modification, they would become 4 times slower, which is clearly not fast enough for interactive applications. The 2021 system could only achieve between 10-20 fps even on a Titan RTX graphics card, and I do not want the new system to be much slower.

My strategy, then, is to scale the input image down to 256×256 and then perform body rotation on it first. For this step, I can use a subnetwork whose architecture is similar to those in my previous projects without any performance penalty. I call this network the half-resolution rotator due to the fact that it has the same functionality as the whole body rotator but operates on half-resolution images. Its output, of course, are half-sized and so not immediately usable. Scaling the outputs up by a factor of 2 would provide images with the right resolution, but these images are "coarse" in the sense that they lack high-frequency details. I thus add another subnetwork called the editor, whose task is to combine the scaled-up outputs into one image and edit it to improve quality.

Note that the editor is the only network that operates on full-resolution images, but it can afford to have lower capacity per input pixel because its task is much easier than that of the half-resolution rotator. We will see later that this design keeps the body rotator speedy enough for real-time applications despite the fact the input is now 4 times larger.

The two networks do not follow the old design exactly. Like the two-algorithm face rotator, the half-resolution rotator still uses two image transformation techniques to produce output images, but they are not the same as the old ones.

The editor is similar to the combiner. However, instead of taking in outputs from both image tranformation techniques of the half-resolution rotator, it now only takes those from the warping one. The image created by direction generation is always discarded.

Figure 3.4.1.2 The overall architecture of the body rotator.

While direct generation seems wasteful and superfluous, it serves as an auxiliary task at training time, and I found that it improved the whole pipeline's performance. This counterintuitive design is a result of evaluating many design alternatives and choosing the best one. (More on this later in Section 3.5.)

I will now discuss each of the subnetworks in more details.

3.4.2   The Half-Resolution Rotator

The half-resolution rotator's architecture is derived from that of the two-algorithm face rotator from my 2019 project. So, it is built with components that I previously used. These include the image features (alpha mask, image change, and appearance flow offset), the image transformation units (partial image change unit, combining unit, and warping unit), and various units such as Conv3, Conv7, ConvDown, Tanh, and so on. I refer the reader to the previous write-up for details.

Figure 3.4.2.1 An overview of the half-resolution rotator's architecture.

From the above figure, the half-resolution rotator has an encoder-decoder main body, which takes in a 256×256 input image and a pose vector and then produces a feature tensor. It then employs two image transformation units on the result to generate the outputs.

Figure 3.4.2.2 The direct generation unit.

The outputs of these units are treated as the outputs of the half-resolution rotator.

Recall that, in the two-algorithm face rotator of the 2019 project, the partial image change unit is used because, in the 2019 problem specification, the body does not move at all, so the network only has to change pixels belonging to the head. However, for the current problem specification, if any of the parameters that control the hip rotation is not zero, then every pixel would change. As a result, it becomes more economical to generate all output pixels directly, and so I replaced partial image change with direct generation.

The specification of the encoder-decoder network is given the the table below.

Tensors Shape
A0= input image 4×256×256
A1= pose vector 6
A2=A1 turned into a 2D tensor 6×256×256
A3=Concat(A0,A2) 10×256×256
B1=LeakyReLU(InstNorm(Cov7(A3))) 64×256×256
B2=LeakyReLU(InstNorm(ConvDown(B1))) 128×128×128
B3=LeakyReLU(InstNorm(ConvDown(B2))) 256×64×64
B4=LeakyReLU(InstNorm(ConvDown(B3))) 512×32×32
C1=ResNetBlock(B4) 512×32×32
C2=ResNetBlock(C1) 512×32×32
C6=ResNetBlock(C5) 512×32×32
D1=LeakyReLU(InstNorm(Conv3(UpsampleNn(C6)))) 256×64×64
D2=LeakyReLU(InstNorm(Conv3(UpsampleNn(D1)))) 128×128×128
D3=LeakyReLU(InstNorm(Conv3(UpsampleNn(D2)))) 64×256×256
Table 3.4.2.3 Specification of the encoder-decoder network that is the main body of the half-resolution rotator. Note that D3 is the feature vector that is fed to the image tranformation units in order to generate the final outputs.

The encoder-decoder network above is an upgraded version of the one used in my 2021 project. Differences from the old design include:

  1. Instead of using the linear rectifier unit (ReLU) as the activation function, I now use the leaky ReLU with slope 0.1 instead.
  2. In the upscaling portion of the encoder-decoder, I use nearest-neighbor upscaling by a factor of 2 followed by a Conv3 instead of a transposed convolution unit. This is done to combat checkerboard artifacts in the outputs
    [Odena et al. 2016].

The units used to build the network are largely the same, but the semantics of some have slightly changed, and I also introduced a number of new ones.

The half-resolution rotator is about 128 MB in size.

Training procedure. I trained the half-resolution rotator using a process similar to that of the two-algorithm face rotator. Training is divided into two phases. In the first phase, the loss function was the L1-norms of the differences between the two generated images and the direct image. LHRR,1=E(Iin,p,Iout)pdata[IwarpedIout1+IdirectIout1] In the second phase, I added a perceptual feature reconstruction loss

[Johnson et al. 2016] on Idirect and adjusted the weights of existing terms. LHRR,2=E(Iin,p,Iout)pdata[20IwarpedIout1+IdirectIout1+4×256×2565Φ(Iout,Idirect)]. Here, Φ(I1,I2)=i=13λi(ϕi(I1rgb)ϕi(I2rgb)1+ϕi(I1aaa)ϕi(I2aaa)1) where

The reader may notice the imbalance between the weights. The L1 losses, 20IwarpedIout1 and IdirectIout1, have small weights, but the weight of the perceptual loss, 4×256×2565Φ(Iout,Idirect), is much larger. The reason for this imbalance is that Φ(Iout,Idirect) is a sum of L1-norms that have been normalized by tensor sizes, but the other two losses are "unnormalized" L1-norms of 4×512×512 tensors. The number 4×256×256 scales Φ(Iout,Idirect) up so that each of its term becomes an unnormalized L1-norm of a 4×256×256 tensor. This puts the perceptual loss on roughly the same order of magnitude as the two L1 losses.

I trained the network with the Adam algorithm, setting β1=0.5 and β2=0.999. The learning rate was 104 for both phases, and the batch size was 8. The first phase lasted 1 epoch (500,000 examples shown), and the second phase lasted 12 epochs (6,000,000 examples shown).

3.4.3   The Editor

Recall from Figure 3.4.1.2 that the outputs of the half-resolution rotator are scaled up by a factor of 2. Then, they are fed to the editor along with the original 512×512 input image and the pose vector. The editor's job, then, is to produce an output image from these data.

Unlike the half-resolution rotator, the editor's main body is a U-Net

[Ronneberger et al. 2015] instead of an encoder-decoder. I made this choice because of the folk wisdom that U-Nets are good for tasks where the input and output images are aligned pixel-to-pixel. Here, I assume that the half-resolution rotator should have moved the pixels to roughly the right locations, and so the editor's output would align pixel-to-pixel with the half-resolution rotator's outputs.

After being fed the all the inputs, the main body produces a feature tensor, which is then fed to a number of image processing steps, leading to the final output. The steps are:

  1. Warping the input image. From the feature tensor, a new appearance flow offset is created. It is then added to the input appearance flow offset, and the result is then used to warp the original 512×512 input image.
  2. Partially change the warped image. We simply apply a partial image change to the warped image generated in the last step. The resulting image is treated as the output of the editor. Let us denote it by Ifinal.

In other words, the editor further modifies the appearance flow offset created by the half-resolution rotator. Ideally, it should add high-frequency details that the rotator could not generate. The editor then "retouch" the the warped image generated by the improved appearance flow offset through partial image changes. The whole process is summarized in the Figure below.

Figure 3.4.3.1 An overview of the editor's architecture.

The Specification of the U-Net network is given in the table below.

Tensors Shape
A0= original input image 4×512×512
A1= pose vector 6
A2= scaled-up warped image
(generated by the half-resolution rotator)
4×512×512
A3= scaled-up appearance flow offset
(generated by the half-resolution rotator)
2×512×512
A4=A1 turned into a 2D tensor 6×512×512
A5=Concat(A0,A4,A2,A3) 16×512×512
B1=LeakyReLU(InstNorm(Cov3(A5))) 32×512×512
B2=LeakyReLU(InstNorm(ConvDown(B1))) 64×256×256
B3=LeakyReLU(InstNorm(ConvDown(B2))) 128×128×128
B4=LeakyReLU(InstNorm(ConvDown(B3))) 256×64×64
C1=ResNetBlock(B4) 256×64×64
C2=ResNetBlock(C1) 256×64×64
C6=ResNetBlock(C5) 256×64×64
D1=Concat(UpsampleNn(C6),B3) 384×128×128
D2=LeakyReLU(InstNorm(Conv3(D1))) 128×128×128
D3=Concat(UpsampleNn(D2),B2) 192×256×256
D4=LeakyReLU(InstNorm(Conv3(D2))) 64×256×256
D5=Concat(UpsampleNn(D4),B1) 96×512×512
D6=LeakyReLU(InstNorm(Conv3(D5))) 32×512×512
Table 3.4.3.2 Specification of the U-Net network that is the main body of the editor. D6 is the feature vector that the U-Net outputs.

Let us note that the U-Net has lower capacity per input pixel than the encoder-decoder of the half-resolution rotator (Table 3.4.2.3). This can be seen from the number of channels of B1, which is the first feature tensor both networks compute from the input. For the encoder-decoder, each pixel is allocated 64 channels, but the number is 32 for the U-Net.

Recall that the time complexity of a convolution on a C×H×W tensor is O(C2HW). Hence, halving the number of channels speeds up convolution by a factor of 4, and doubling the image size (i.e., H2H and W2W) also slows it down by a factor of 4. Because the U-Net operates on tensors that are 2 times larger in height and width but have 2 times fewer channels, its convolutions thus have the same time complexity as those in the encoder-decoder network because the aforementioned changes cancel each other out. As a result, we can say that the U-Net's time complexity is of the same order of magnitude as that of the half-resolution rotator (modulo, of course, the more complex calculation when scaling feature tensors up after the bottleneck part). Capacity reduction thus keeps the editor fast despite the fact it operates on images that have 4 times more pixels.

The editor is about 33 MB in size, which is about one forth the size of the half-resolution rotator. This is because a convolution unit that operates on a C×H×W tensor requires O(C2) space to store its parameters. As a result, halving the number of channels reduces size by a factor of 4.

Training procedure. I used a loss function with 4 terms. Leditor=E(Iin,p,Iout)pdata[λL1LL1+λperceptLpercept+λL1neckLL1neck+λperceptneckLperceptneck].

Here, LL1 is the L1 difference between the groundtruth image and the final output. LL1=IoutIfinal1.

Lpercept is the perceptual feature reconstruction loss between Iout and Ifinal. I did not evaluate the loss on the whole 512×512 images because I found it to be too slow. Instead, I divided the images into 4 quadrants of size 256×256 and evaluate Lpercept=14[Φ(IoutQ1,IfinalQ1)+Φ(IoutQ2,IfinalQ2)+Φ(IoutQ3,IfinalQ3)+Φ(IoutQ4,IfinalQ4)] where IQ1,IQ2,IQ3,IQ4 denote the 4 quadrants of image I. To speed up the computation of the above expression, I estimated it by uniformly sampling a quadrant and only evaluating the loss for that quadrant.

The LL1neck and Lperceptneck terms were added to alleviate the neck color problem that I encountered in Figure 3.3.3.2. They are the same as LL1 and Lpercept except that they operate on the 64×64 subimage around the neck of the character.

Figure 3.4.3.3 The neck subimage that is used to evaluate LL1neck and Lperceptneck.

Let us denote the neck subimage of I by Ineck. The two neck loss terms are given by: LL1neck=IoutneckIfinalneck1,Lperceptneck=Φ(Ioutneck,Ifinalneck). Because Ioutneck and Ifinalneck are only 64×64 in resolution, directly evaluating the perceptual feature reconstruction loss on them was fast enough that I did not have to split them into quadrants like what I did for Lpercept.

The coefficients of the terms were λL1=14,λpercept=4×256×2565,λL1neck=16,λperceptneck=4×256×2565.

Training the editor requires the half-resolution rotator because we have to use it to generate two of the editor's inputs. However, I froze its parameters so that only the editor's parameters were updated during training. Again, I used the Adam algorithm with β1=0.5, β2=0.999, learning of 104, and batch size of 8. Training lasted for 6 epochs (3,000,000 examples shown).

3.5   Results

3.5.1   Comparison Against Other Design Variations

The design of the body rotator I presented the last section is rather counterintuitive: the half-resolution rotator has an output that is always discarded. I came to this design by picking the best one out of many variations.

I evaluated 2 designs for the half-resolution rotator and 6 designs for the editor. I also considered a design where a standalone network performs the whole body rotation task. Because some half-resolution rotator designs are not compatible with certain editor designs, there were 10 valid variations in total.

Half-resolution rotator designs. Of course, one the of the designs is the one I presented in Section 3.4.2. The other design is a variation of that design in which the direct generation branch is removed. Let us refer to the simpler design as "Rotator A," and the design in Section 3.4.2 as "Rotator B."

Rotator A Rotator B
Figure 3.5.1.1 Half-resolution rotator designs.

Rotator B was trained with the process described in Section 3.4.2. Rotator A's process was similar because the only changes were the loss functions. The first phase's loss function was LRA,1=E(Iin,p,Iout)pdata[IoutIwarped], and the second phase's loss function was LRA,2=E(Iin,p,Iout)pdata[20IoutIwarped1+4×256×2565Φ(Iout,Iwarped)].

Editor designs. I considered 6 designs for the editor. All of them has the same U-Net (Table 3.4.3.2) as their main bodies, but they differ in what inputs they take in and how they generate the final output image.

As for the inputs, there are 5 data items that the editor can take in.

So, an editor's inputs must form a subset of {Iin,p,Iwarped,Idirect,ΔF}. I explored four subsets.

In other words, all editors must take in Iin, p, and Iwarped, but ΔF and Idirect are optional.

As for ways to generate the final output image Ifinal, I explored three approaches.

Note that Approach γ requires ΔF, but the other two approaches do not need it. Based on this observation, there are 6 possible designs as listed in the table and figure below.

Name Inputs How to Generate Ifinal
Editor U {Iin,p,Iwarped} Approach α
Editor V {Iin,p,Iwarped,Idirect} Approach α
Editor W {Iin,p,Iwarped} Approach β
Editor X {Iin,p,Iwarped,Idirect} Approach β
Editor Y {Iin,p,Iwarped,ΔF} Approach γ
Editor Z {Iin,p,Iwarped,ΔF,Idirect} Approach γ
Table. 3.5.1.2 Editor designs.

Editor U Editor V
Editor W Editor X
Editor Y Editor Z
Figure 3.5.1.3 Editor designs.

Note that Editor Y is the one previously presented in Section 3.4.3.

To form a complete body rotator, we must connect a half-resolution rotator with an editor. We can see that Rotator A cannot work with any editor that takes Idirect as an input, but Rotator B is compatible with all editors. As a result, there are 3+6=9 possible designs.

Editor U Editor V Editor W Editor X Editor Y Editor Z
Rotator A
Rotator B
Table 3.5.1.3 Compatibility between half-resolution rotator designs and editor designs. There are 9 valid designs for the body rotator in total.

All editors were trained with the process in Section 3.4.3. Note that I must train two copies of each of Editor U, W, and Y because they all belong to two different body rotator designs. One copy must be trained with Rotator A and the other with Rotator B.

Single network design. I also evaluated an architecture where a single network is responsible for performing the body rotation task end-to-end. The network has an encoder-decoder main body whose construction is similar to that in Table 3.4.2.3. However, because the input is 512×512 rather than 256×256, the encoder-decoder features one extra downsampling step in the yellow section and one extra upsampling step in the red section. The first feature tensor created from the input is of size 32×512×512 instead of 64×256×256. The feature tensor outputted by the encoder-decoder is used to warp the input image and then partially change it. The network's architecture is depicted below.

Figure 3.5.1.4 The single network rotator design.

The network was trained in two phases. In the first phase, the loss function was LSNR,1=EIin,p,Iout[IoutIfinal1]. In the second phase, perceptual feature reconstruction losses for the whole image and the neck subimage were added, and the loss function became LSNR,2=EIin,p,Iout[14IoutIfinal1+λLpercept+λLperceptneck] where λ=(4×256×256)/5, and Lpercept and Lperceptneck are as defined in Section 3.4.3

[*]. Other settings related to the training process (optimization algorithm, batch size, learning rate, and the phase lengths) were exactly the same as those in Section 3.4.2.

Quantitative evaluation. I fed each design the test dataset and have it produce one output image per test example. I then computed the similarity between the output and the ground truth image with three similarity metrics: the root mean square error (RMSE), the Structural Similarity (SSIM) metric

[Wang et al. 2004], and the Learn Perceptual Image Patch Similar (LPIPS) metric
[Zhang et al. 2018]. I report the metric values averaged over the whole dataset in the table below.

Body Rotator Design RMSE () SSIM () LPIPS ()
Rotator A + Editor U 0.15529800 0.90527600 0.06305300
Rotator A + Editor W 0.15511400 0.90652400 0.06088700
Rotator A + Editor Y 0.16160900 0.90324800 0.05088500
Rotator B + Editor U 0.15586100 0.90644100 0.05237600
Rotator B + Editor V 0.15574700 0.90663700 0.05211400
Rotator B + Editor W 0.15550700 0.90823300 0.05051800
Rotator B + Editor X 0.15545500 0.90821200 0.05051200
Rotator B + Editor Y 0.15437500 0.90950700 0.04874800
Rotator B + Editor Z 0.15870300 0.90583800 0.05042200
Single network rotator 0.17180100 0.89646500 0.05791900
Table 3.5.1.5 Quantitative evaluation of the body rotator designs. () means "lower is better", and () means "higher is better."

First, we can see that the single network design performed worse than all two-network designs according to the RMSE and SSIM metrics. Moreover, its LPIPS score is also much higher than the best two-network design. This result informs us that two networks are better than one.

Next, one can see that "Rotator B + Editor Y" was the best because all of its three metrics were the best. Interestingly, "Rotator B + Editor Z" has slightly more capacity than "Rotator B + Editor Y," yet it performed worse according to all metrics. An explanation for this result might be that taking the direct image as input actually diverted the network's attention away from how to process the warped image and its appearance flow offset.

"Rotator B + Editor Y" is also better than "Rotator A + Editor Y" on all metrics. In other words, although we discard direct images generated by Rotator B, having the rotator generate them actually leads to better inputs for the editor. This is an example of improving a task's performance by training a network to also solve related auxiliary tasks

[Ruder 2017].

Qualitative evaluation. I created a sequence of pose vectors that contain all 6 types of movements controllable by the body rotator. I then used the designs to animate pictures of eight MMD models according to the pose vector sequence to render 10 videos. I also converted the pose sequence into ones that were applicable to the MMD models and animated them to create ground truth videos. The videos, arranged side by side for comparison, are available in Figure 3.5.1.6. Another version of the videos where only the faces are shown are available in Figure 3.5.1.7.

Target Character:
Figure 3.5.1.6 Comparison of animations generated by the 10 body rotator designs evaluated in this section. Again, for each character, the ground truth animation was created by rendering its 3D model. Other animations were generated by the body rotators to animate the first frame of the ground truth animation.

Target Character:
Figure 3.5.1.7 The same animations in Figure 3.5.1.6 but with the faces being zoomed in.

From the animations, we can see the reasons why the chosen design (Rotator B + Editor Y) were clearly better than some alternatives.

Comparison against single network design. The single network design can produce black contours that are too large.

Ground truth Single network Rotator B + Editor Y
(Section 3.4)

It can also erase thin structures too aggressively.

Ground truth Single network Rotator B + Editor Y
(Section 3.4)

Comparison against "Rotator A" designs. "Rotator A + Editor U" and "Rotator W" can produce aliasing artifacts and remove high frequency details from the outputs. (See the character's eyes in the figure below.)

Ground truth Rotator A + Editor U Rotator A + Editor W Rotator B + Editor Y
(Section 3.4)

Moreover, I also observed that designs with "Rotator A" could produce very incorrect distortions.

Ground truth   Rotator B + Editor Y
(Section 3.4)
 
Rotator A + Editor U Rotator A + Editor W Rotator A + Editor Y

There were, however, no major differences between the outputs of the chosen design (Rotator B + Editor Y) and other designs with "Rotator B." So, the choice of Editor Y was mainly informed by the quantitative comparison.

In conclusion, I chose the "Rotator B + Editor Y" design because its performance, as indicated by the similarity metrics, was the best. Moreover, its outputs also looked better than ones produced by the single network design and those with "Rotator A."

3.5.2   Comparison with Other Work

There are many previous works that can generate animation from a single image, and I have surveyed them in details in the write-up of my 2021 project. All of my VTuber-related projects solve the problem of parameter-based posing where the input consists of a single image of the character and a pose vector, and the task is to pose the character accordingly. There is a related problems of motion transfer, where we are given an image or a video of a subject (the source), and we have to make another subject (the target) imitate the pose of the source.

In the 2021 project, I compared my system to those proposed by Averbuch-Elor et al.

[2017] and Siarohin's et al.
[2019]. While I was able to show that my system outperformed them when animating anime characters, I was not comparing apples to apples because other systems solve motion transfer, but mine solves parameter-based posing. In particular, other systems were at a disadvantage when the source character was not the same as the target character.

There were previous works on parameter-based posing, but they were either already a part of my system (such as Pumarola et al.'s work) or not convenient to compare against (such as Ververas's and Zaferiou's

[2020]). Fortunately, later in 2021, Ren et al. proposed a new neural network system for parameter-based posing called PIRenderer
[2021]. So, in this article, I will compare my new system against it.

PIRenderer's overall design is similar to that of my body rotator. There is a network the produces a low-resolution appearance flow, and it is followed by a network that tries to refine the resulting warped image. However, the authors seem to be inspired by StyleGAN

[Karras et al. 2019] as they use a mapping network to convert the input pose into a latent code that is used to modulate other networks through adaptive instance normalization (AdaIN)
[Huang and Belongie 2017]. My networks, on the other hands, do not use any of these structures. Both PIRenderer and my system were trained with the perceptual losses introduced by Johnson et al.
[2016]. The difference is that PIRenderer uses both the content and style loss, but my system only uses the former.

While the system's source code is publicly available, I did not use it directly because it was easier for me to reimplement and adapt it to my coding framework. I also introduced the following changes.

First, to make PIRenderer compatible with my system, I made the mapping network take a single pose vector as input instead of a window of pose vectors inside an animation sequence. This change simplifies its architecture because the "center crop" unit is no longer needed. The mapping network thus became a multi-layer perceptron (MLP) that turns a 6-dimensional pose vector into a 256-dimensional latent vector . Each of its hidden layer has 256 units.

Second, to make a fair comparison between PIRenderer and my system, I adjusted PIRenderer's hyperparameters to make its networks roughly the same size as my networks. In particular, I raised the number of of maximum channels in each layer from 256 to 512 and then made the following adjustments.

Third, I changed PIRenderer's training process to be similar to the way I trained my networks. Training has two phases. In the first phase, the mapper and the warper were trained together for 12 epochs (6,000,000 examples). In the second phase, the mapper are the warper were frozen, and the editor were trained for 6 epochs (3,000,000 examples). Both phases used batch size of 8, learning rate of 104, and the Adam algorithm with β1=0.5 and β2=0.999. The main reason for not training all the networks together was that my GPU's memory could not handle the batch size of 8.

Quantitative evaluation. I evaluated my system against PIRenderer with the three image similarity metrics used in the last subsection. The scores, computed with the test dataset, are given in the table below.

System RMSE () SSIM () LPIPS ()
PIRenderer 0.16772200 0.89525700 0.05257600
My system (Section 3.4) 0.15446600 0.90975400 0.04807400
Table 3.5.2.1 Quantitative comparison between PIRenderer and my system. () means "lower is better", and () means "higher is better."

Qualitative evaluation. I used PIRenderer and my system to generate animations of the eight 3D models used to quantitatively compare my system against alternative architectures. The results are given in Figure 3.5.2.2 (full body) and Figure 3.5.2.3 (face zoom).

Target Character:
Figure 3.5.2.2 Comparison between the ground truth 3D animations and videos generated by PIRenderer and my system. The characters are Kizuna AI (© Kizuna AI),
Tokino Sora (© Tokino Sora Ch.), Minato Aqua (© 2019 cover corp.), Akiyama Rentarou (© ひま食堂), Suou Patra (© HoneyStrap), Inaba Haneru (© 有閑喫茶あにまーれ), Kitakami Futaba (© Appland, Inc.), and Kagura Suzu (© Appland, Inc.).

Target Character:
Figure 3.5.2.3 The same animations in Figure 3.5.2.2 but with the faces being zoomed in.

Figure 3.5.2.4 lists some differences I could observe from the animations. In general, my system produced blurrier images than PIRenderer did, and it could also erase thin structures such as hair strands and ribbons while PIRenderer preserved them better. However, I preferred my system over PIRenderer because the latter could deform faces in undesirable ways while my system preserved their shapes better.

Ground truth PIRenderer My System
My system generated images that were blurrier than those generated by PIRenderer. It also tended to blur out or completely remove thin structures such as the ribbons above. On the other hand, PIRenderer's images were sharper, and thin structures were better preserved.
Ground truth PIRenderer My System
However, I observed that PIRenderer's outputs contained more aliasing artifacts (i.e., jagginess) than those of my system (which tended to be smoother and sometimes overly blurry). Aliasing is very noticeable in animation, and so I preferred my system over PIRenderer in this regard.
Ground truth PIRenderer My System
/> />
When the angles were large, PIRenderer sometimes rotated the face less accurately than my system did.
Ground truth PIRenderer My System
/> />
Moreover, PIRenderer sometimes disfigured the character's face, but I could not observed my system doing so.
Figure 3.5.2.4 Some observed qualitative differences between outputs generated by PIRenderer and my system.

A major problem with PIRenderer is its overuse of warping. This becomes the most noticeable when a character wears cloth with a turtleneck that extends upward to just below the chin when seen from the front. When the character turns its face up, PIRenderer would use warping to lift the chin, but the warping would also drag the turtleneck with the chin as well. My system, on the other hand, can use local image change to hallucinate the neck skin that is supposed to become visible, resulting in a much more plausible output.

Figure 3.5.2.5 PIRenderer mainly transforms the input image with warping and so did not produce sensible outputs when a character wearing a turtle neck turns her face up and down. My system, on the other hand, did not exhibit this problem. The character is
Aiba Uiha (© ANYCOLOR, Inc.)

In conclusion, my system was better at rotating an anime character's head and body than PIRenderer. It better preserved the head's shape, produced fewer artifacts, and could hallucinate disoccluded neck skin. On the other hand, PIRenderer would drag pixels just below the chin around when the face moved.

4   Improving Efficiency

In the last section, I proposed a new body rotator network that can animate the upper body of anime characters. However, recall from Figure 3.2.1 that it is a part of a larger system that can also modify facial expression. While the system as a whole became a little smaller (517 MB instead of 600 MB) due to the editor being smaller than the combiner, it is still quite large. It also takes about 35 ms to process an image end to end using my Titan RTX graphics card. The large size makes it impractical to deploy it on mobile devices, and processing speed would only worsen when using less powerful GPUs. It is thus crucial to make the system smaller and faster in order to improve its versatility. I discuss my attempt to do so in this section.

4.1   Techniques Used

To improve my system's efficiency, I experimented with two techniques.

Depthwise separable convolutions. The technique was introduced by Sifre in 2014

[Sifre 2014], but I was particularly inspired by Howard et al.'s MobileNets paper
[Howard et al. 2017].

All networks in my system are convolutional neural networks (CNNs), meaning that their main building blocks are convolution layers. Such a layer typically takes in a tensor of size C1×H×W and convolves it with a kernel of size C1×C2×K×K in order to produce a new tensor of size C2×H×W. The time complexity of this operation is O(C1C2HWK2), and an O(C1C2K2) space is required to store the layer's parameters.

To improve the networks' efficiency, I replaced all convolution layers in their main bodies (i.e., the encoder-decoder networks and the U-Nets) with two convolution layers that are applied in succession.

The effect of separating a standard convolution layer into two is that the time complexity reduces to O(C1HWK2+C1C2HW)=O(C1HW(K2+C2)) and space complexity reduces to O(C1K2+C1C2)=O(C1(K2+C2)). Thus, by employing the technique, both the time and space complexities are reduced by a factor of C1HW(K2+C2)C1C2HWK2=C1(K2+C2)C1C2K2=K2+C2K2C2=1C2+1K2. In my networks, K is 3 most of the time, and C2 is typically at least 32. Thus, in theory, the system could become about 9 times faster and smaller.

Note that, while depthwise separable convolutions can reduce the amount of memory needed to store model parameters, it does not reduce the amount of memory needed to store data tensors while inference is taking place. More specifically, if b is the number of bytes used to represent an element of a data tensor, the input tensor would need bC1HW bytes, and the output tensors bC2HW bytes. So, assuming that standard convolution does not produce any intermediate tensors, it would need b(C1+C2)HW bytes to store both the input and output. On the other hand, depthwise separable convolutions have to produce an intermediate tensor of size C1×H×W. As a result, the first convolution would need 2bC1HW bytes, and the second b(C1+C2)HW bytes. The needed amount of memory would be b[C1+max(C1,C2)]HW assuming the input is discarded after we have obtained the output. Clearly, this is not smaller than b(C1+C2)HW. To conclude, the technique can reduce the HDD space needed to store a model, but it may not reduce the amount of RAM used during inference.

Using "half." I implemented my network with PyTorch, which uses the 32-bit floating point type (aka float) to represent almost all data and parameters. Nevertheless, the numerical precision afforded by float might not be necessary, and so PyTorch features an option to use the 16-bit floating type (aka half) instead. While I was not sure a priori how much using half can improve processing speed, there's the obvious benefit that both the system's size and RAM usage would become two times smaller.

4.2   Results

The whole system consists of 5 subnetworks. (3 from the 2021 system, and 2 from Section 3.4.) For each subnetwork, I created four variants based on the combination of techniques used. If a variant uses depthwise separable convolutions, it is designated with the word "separable;" otherwise, it is designated with the word "standard." A variant is designed with the floating point type it uses. As a result, the variants are referred to as "standard-float," "separable-float," "standard-half," and "separable-half."

To create the variants, I trained stardard-float and seperable-float models from scratch. I then created standard-half and separable-half models from the corresponding "float" models by convering all parameters to "half." Note that standard-float is the variant that receives no efficiency improvements, serving as the control group.

4.2.1   System Size

The clearest benefit of the techniques is size reduction. As predicted, using depthwise separable convolution reduced the size by a factor of about 9, and using half cuts it further in half.

  Size in MB (and improvement over standard-float)
Networks standard-float separable-float standard-half separable-half
Eyebrow segmenter 120.11
(1.00x)
12.70
(9.46x)
60.07
(2.00x)
6.36
(18.87x)
Eyebrow warper 120.32
(1.00x)
12.72
(9.46x)
60.17
(2.00x)
6.37
(18.88x)
Eye & mouth morpher 120.59
(1.00x)
12.75
(9.45x)
60.31
(2.00x)
6.39
(18.86x)
Half-resolution rotator 124.62
(1.00x)
13.69
(9.10x)
62.32
(2.00x)
6.86
(18.16x)
Editor 31.92
(1.00x)
3.63
(8.80x)
15.97
(2.00x)
1.83
(17.45x)
Whole system 517.56
(1.00x)
55.48
(9.33x)
258.84
(2.00x)
27.82
(18.60x)
Table 4.2.1.1 The effect of the efficiency improvement techniques on the size in MB of the networks and the whole system.

4.2.2   RAM Usage

As noted earlier, making the networks smaller does not always means that they would overall use less memory. To see the techniques' impact on memory requirement, I measured GPU RAM usage by the following process.

I conducted experiments on three computers:

  1. Computer A is a desktop PC with an Nvidia Titan RTX GPU (driver version 511.79, CUDA version 10.2.89), a 3.60 GHz Intel Core i9-9900KF CPU, and 64 GB of RAM. It represents a high-end gaming PC.
  2. Computer B is a desktop PC with an Nvidia GeForce GTX 1090 Ti GPU (driver version 456.71, CUDA version 10.2.89), a 3.70 GHz Intel Core i7-8700K CPU, and 32 GB of RAM. It represents a typical (yet somewhat outdated) gaming PC.
  3. Computer C is a laptop with an Nvidia GeForce MS250 GPU (driver version 511.65, CUDA version 11.6.2), a 1.19 GHz Intel Core i5-1035G1 CPU, and 8 GB of RAM. It represents a general PC with low processing power.

The measurement values are given in the tables below.

  RAM usage in MB on Computer A
(and improvement over standard-float)
Networks standard_float separable_float standard_half separable_half
Eyebrow segmenter 158.34
(1.00x)
38.51
(4.11x)
94.68
(1.67x)
25.32
(6.25x)
Eyebrow warper 159.10
(1.00x)
40.27
(3.95x)
95.17
(1.67x)
25.71
(6.19x)
Eye & mouth morpher 152.72
(1.00x)
71.94
(2.12x)
94.10
(1.62x)
49.37
(3.09x)
Half-resolution rotator 209.98
(1.00x)
113.03
(1.86x)
121.32
(1.73x)
80.72
(2.60x)
Editor 312.22
(1.00x)
349.10
(0.89x)
207.46
(1.50x)
240.81
(1.30x)
Whole System 816.91
(1.00x)
417.49
(1.96x)
467.39
(1.75x)
274.80
(2.97x)
Table 4.2.2.1 RAM usage in MB of the whole system and the five constituent networks. The experiments were conducted on Computer A (Nvidia GeForce MS250, a 1.19 GHz Intel Core i5-1035G1 CPU, and 8 GB of RAM).

  RAM usage in MB on Computer B
(and improvement over standard-float)
Networks standard_float separable_float standard_half separable_half
Eyebrow segmenter 158.34
(1.00x)
38.51
(4.11x)
86.17
(1.84x)
19.31
(8.20x)
Eyebrow warper 159.10
(1.00x)
40.27
(3.95x)
86.92
(1.83x)
19.70
(8.08x)
Eye & mouth morpher 152.72
(1.00x)
71.94
(2.12x)
83.75
(1.82x)
35.58
(4.29x)
Half-resolution rotator 209.98
(1.00x)
113.03
(1.86x)
104.80
(2.00x)
56.71
(3.70x)
Editor 312.22
(1.00x)
349.10
(0.89x)
155.96
(2.00x)
176.81
(1.77x)
Whole System 816.91
(1.00x)
417.49
(1.96x)
417.39
(1.96x)
210.80
(3.88x)
Table 4.2.2.2 RAM usage in MB of the whole system and the five constituent networks. The experiments were conducted on Computer B (Nvidia GeForce GTX 1090 Ti GPU, a 3.70 GHz Intel Core i7-8700K CPU, and 32 GB of RAM).

  RAM usage in MB on Computer C
(and improvement over standard-float)
Networks standard_float separable_float standard_half separable_half
Eyebrow segmenter 142.84
(1.00x)
38.54
(3.71x)
71.42
(2.00x)
19.32
(7.39x)
Eyebrow warper 143.61
(1.00x)
40.30
(3.56x)
71.81
(2.00x)
19.72
(7.28x)
Eye & mouth morpher 144.22
(1.00x)
71.98
(2.00x)
71.86
(2.01x)
35.59
(4.05x)
Half-resolution rotator 209.98
(1.00x)
113.07
(1.86x)
108.80
(1.93x)
56.33
(3.73x)
Editor 312.22
(1.00x)
349.11
(0.89x)
156.96
(1.99x)
176.81
(1.77x)
Whole System 813.91
(1.00x)
414.08
(1.97x)
415.89
(1.96x)
209.30
(3.89x)
Table 4.2.2.3 RAM usage in MB of the whole system and the five constituent networks. The experiments were conducted on Computer C (Nvidia GeForce MS250, a 1.19 GHz Intel Core i5-1035G1 CPU, and 8 GB of RAM).

From the data, we can spot a number of trends.

Firstly, while GPU RAM usages by the same network did not differ much between the computers, we can see that those with more capacity tended to use more RAM. This might be because PyTorch collected garbage more aggressively on computers with less available memory.

Secondly, the amount of RAM used by a network was always more than the network's size. This is assuring because it means that my measurement method properly took into account of model parameters.

Thirdly, using half reduced memory usage by factors close to 2 on all machines. This is consistent with the expectation that the half type should halve the amount of space needed to store everything.

Fourthly, depthwise separable convolutions reduced RAM usage of all networks except the editor. This can be explained by the observation that the technique can decrease memory used to store model parameters by about 9 times, but it cannot decrease memory used to store data tensors at all. The editor ( 30 MB) was much smaller than other networks ( 120 MB) so the reduction in model size was dominated by the increase in space for data tensors.

To summarize, using half reduced memory requirement under all circumstances. However, using depthwise separable convolutions was only beneficial to large networks (i.e., all networks except the editor). All in all, using both techniques resulted in about 3-4 times reduction of RAM usage for the whole system.

4.2.3   Speed

Another metric we care about is how fast the system is. To measure processing speed, I ran the whole system and the individual networks 100 times with artificial inputs (again, tensors whose values are all zeros) of batch size 1, measured the wall clock time of each run, and then computed the average processing time

[footnote]. The numbers are available in the tables below.

  Average processing time in milliseconds on Computer A
(and improvement over standard-float)
Networks standard-float separable-float standard-half separable-half
Eyebrow segmenter 4.159
(1.00x)
4.792
(0.87x)
4.996
(0.83x)
5.212
(0.80x)
Eyebrow warper 4.726
(1.00x)
5.772
(0.82x)
5.450
(0.87x)
5.689
(0.83x)
Eye & mouth morpher 7.163
(1.00x)
5.223
(1.37x)
5.495
(1.30x)
5.608
(1.28x)
Half-resolution rotator 8.865
(1.00x)
6.001
(1.48x)
5.094
(1.74x)
5.605
(1.58x)
Editor 13.699
(1.00x)
11.185
(1.22x)
7.469
(1.83x)
7.986
(1.72x)
Whole system 34.105
(1.00x)
26.777
(1.27x)
23.803
(1.43x)
24.540
(1.39x)
Table 4.2.3.1 Average processing time in milliseconds of the whole system and the five constituent networks. The experiments were conducted on Computer A (Nvidia Titan RTX GPU, a 3.60 GHz Intel Core i9-9900KF CPU, and 64 GB of RAM).

  Average processing time in milliseconds on Computer B
(and improvement over standard-float)
Networks standard-float separable-float standard-half separable-half
Eyebrow segmenter 5.125
(1.00x)
5.024
(1.02x)
5.168
(0.99x)
5.064
(1.01x)
Eyebrow warper 6.717
(1.00x)
5.441
(1.23x)
5.887
(1.14x)
5.356
(1.25x)
Eye & mouth morpher 8.522
(1.00x)
6.373
(1.34x)
8.068
(1.06x)
6.014
(1.42x)
Half-resolution rotator 11.259
(1.00x)
9.246
(1.22x)
10.592
(1.06x)
8.316
(1.35x)
Editor 18.374
(1.00x)
24.277
(0.76x)
13.670
(1.34x)
19.273
(0.95x)
Whole system 43.841
(1.00x)
46.959
(0.93x)
38.019
(1.15x)
38.848
(1.13x)
Table 4.2.3.2 Average processing time in milliseconds of the whole system and the five constituent networks. The experiments were conducted on Computer B (Nvidia GeForce GTX 1090 Ti GPU, a 3.70 GHz Intel Core i7-8700K CPU, and 32 GB of RAM).

  Average processing time in milliseconds on Computer C
(and improvement over standard-float)
Networks standard-float separable-float standard-half separable-half
Eyebrow segmenter 100.705
(1.00x)
31.590
(3.19x)
121.911
(0.83x)
31.578
(3.19x)
Eyebrow warper 106.348
(1.00x)
32.270
(3.30x)
131.114
(0.81x)
32.172
(3.31x)
Eye & mouth morpher 132.273
(1.00x)
70.078
(1.89x)
240.195
(0.55x)
73.969
(1.79x)
Half-resolution rotator 211.828
(1.00x)
72.645
(2.92x)
345.056
(0.61x)
91.863
(2.31x)
Editor 269.462
(1.00x)
157.015
(1.72x)
412.179
(0.65x)
192.638
(1.40x)
Whole system 690.751
(1.00x)
335.364
(2.06x)
1125.345
(0.61x)
385.041
(1.79x)
Table 4.2.3.3 Average processing time in milliseconds of the whole system and the five constituent networks. The experiments were conducted on Computer C (Nvidia GeForce MS250, a 1.19 GHz Intel Core i5-1035G1 CPU, and 8 GB of RAM).

While there were clear and predictable patterns in network sizes and RAM usages, changes in processing time were much less predictable.

On all machines, however, employing both techniques greatly reduced the system's memory requirement and also had positive (while not impressive) impact on speed. Thus, one should always apply them together.

4.2.4   Visual Quality

Each of the techniques resulted in fewer bits being used to represent model parameters. As a result, it is expected that smaller variants would be less accurate. This can be confirmed by computing the similarity metrics on the test set. We can see that, because depthwise separable convolutions reduce network sizes by a factor of 9, it has more impact on the metrics than using half.

Variants RMSE () SSIM () LPIPS ()
standard-float 0.15518600 0.90928100 0.04903900
separable-float 0.16107100 0.90425300 0.05354000
standard-half 0.15551100 0.90892600 0.05008000
separable-half 0.16141600 0.90390400 0.05458700
Table 4.2.4.1 Similarity metrics between the ground truth images and the outputs produced by the 4 system variants.

I also rendered animations using the 4 variants for qualitative comparisons.

Target Character:
Figure 4.2.4.2 Comparison between the ground truth 3D animations and videos generated by the 4 variants of my system. The characters are Kizuna AI (© Kizuna AI), Tokino Sora (© Tokino Sora Ch.), Akiyama Rentarou (© ひま食堂), Kitakami Futaba
(© Appland, Inc.), Kiso Azuki (© Appland, Inc.), Mokota Mememe (© Appland, Inc.), and Kagura Suzu (© Appland, Inc.).

Target Character:
Figure 4.2.4.3 The same animations in Figure 4.2.6 but with the faces being zoomed in.

The variants with the same type of convolution layers produced virtually the same animations because the parameters were essentially the same but with different precisions. Variants with different convolution layers, however, produced different mouth shapes. Mouths produced by separable convolutions were smaller than those produced by standard ones. For Kiso Azuki, the mouth did not move at all, showing that separable convolutions can yield inaccurate results in some cases. This might be the price to pay for a significant reduction in model size.

In conclusion, the techniques I experimented on this section, when combined, were effective at reducing the system's size and RAM usage. Additionally, they had positive but not significant impact (no more than 2x) on speed. The techniques surely made the system more employable on less powerful devices, but there is still much room for improvement in processing time.

5   Applications to Drawings

As with previous versions of the system, my end goal is to animate drawings, not 3D renderings. In this section, I demonstrate how the system performs on such inputs.

5.1   Simple Character Animations

I used the system to animate 72 images of VTubers and related characters. Sixteen resulting videos are available below, and the rest can be seen in Figure 5.1.2.

Figure 5.1.1 Videos created by applying the "standard-full" variant of the system to 16 draw characters.

Target Character:
Figure 5.1.2 Videos created by applying the "standard-full" variant of the system to drawings of various VTubers and associated characters.

5.2   Breathing Motion

The animation in the previous subsection does include the breathing motion, but it is hard to notice because its effect is subtle compared other motion types. The figure blow shows the pure breathing motion of the 16 characters in Figure 5.1.1.

Figure 5.2.1 Breathing motion of 16 drawn characters. The videos were created using the "standard-full" variant of the system.

We can see that the movement is most of the time plausible. However, there are cases where the network wrongly classified the network wrongly located the chest area. For example, the skirt of the bottom right character also moves with the chest.

5.3   Direct Manipulation of Drawings through GUI

I created a tool that allows the user to control drawings by manipulating GUI elements.

5.4   Transferring Human Motion to Characters

I also created another tool that can transfer a human's movement, captured by an iOS application called iFacialMocap, to anime characters.

5.5   Failure Cases

The above demonstrations show that my system is capable of generating good-looking animations when applied to many different characters. However, it can yield implausible results when fed inputs that deviate significantly from the training set. Because I did not change the face morpher in any way, its problems, as discussed in the 2021 writeup, still remain. These include the inability to handle unnatural skin tone, strong makeups, "maro" eyebrows, and rare occlusion of facial features. I also observed new problems specific to the newly designed body rotator.

First, the body rotator did not seem to handle large hats well. For examples, it thought that Nui Sociere's face is a part of the hat, erased the ears and tail of Millie Parfait's cat, and moved Nina Kosaka's ears in a way that was inconsistent with her head movement. These errors might be because my dataset does not have many models wearing large hats.

Nui Sociere (© ANYCOLOR, Inc.) Millie Parfait (© ANYCOLOR, Inc.) Nina Kosaka (© ANYCOLOR, Inc.)

Second, it also had a tendency to erase thin structures around the head. While this might be acceptable for thin hair filaments, the result can be very noticeable for halos and rigid ornaments that should always be present.

Ninomae Ina'nis (© Cover corp.) Ouro Kronii (© Cover corp.)

Third, due to the lack of training data, the body rotator cannot correctly deal with props such as weapons and musical instruments.

Gawr Gura (© Cover corp.) Mori Calliope (© Cover corp.)
Kazama Iroha (© Cover corp.) Rikka (© Cover corp.)

6   Related Works

This project aims to solve a variant of the image animation problem. Here, we are given an image of a character and a description of the pose that the character is supposed to take, and we are supposed to generate a new image of the same character taking the described pose. The problem can the be classified into several variants based on the nature of the pose description. They include the parameter-based posing and the motion transfer problem previously mentioned in Section 3.5.2.

In my 2021 write-up, I wrote an extensive survey of previous research on the two problems. Doing such a survey again would only make this article unreasonably long. So, what I would like to do instead is to discuss new research and development that came out after I published the write-up.

6.1   Research on Parameter-Based Posing

PIRenderer, the paper I compared my system against in Section 3.5.2, was published in ICCV 2021

[Ren et al. 2021]. It stood out to me because there have been much fewer papers on parameter-based posing compared to motion transfer. As previously discussed in Section 3.5.2, though, my implemention of PIRenderer performed worse than my system, and it seemed to have problems animating the neck.

I became aware of PIRenderer through the AnimeCeleb paper by Kim et al.

[2021]. It documents the authors' attempt to create a dataset of posed anime characters by rendering MikuMikuDance models in a way similar to what I did in my 2021 project. The authors then used the dataset to train PIRenderer and demonstrated that the system also worked on anime characters.

Outside of academia, IRIAM, a streaming application where users can broadcast as anime characters, released a feature where a Live2D-like 2.5D model can be created from a single image. The enabling technology was the work of my friend, Yanghua Jin, while he was employed by Preferred Networks. From the promotion video, it seems that IRIAM supports facial expression manipulation and rotation of the body and the head around the z-axis. Rotation around the y-axis, nevertheless, is limited. One great advantage of their approach is that, once a 2.5D model has been created, it can be rendered with very low computational cost on mobile devices. On the other hand, even with the efficiency improvements in Section 4, it is still very hard to deploy my networks on a smart phone.

6.2   Research on Motion Transfer

There are a number of interesting new approaches to motion transfer, especially those that discover moving parts without explicit supervision.

The paper "Motion Representations for Articulated Animation" (MRAA)

[Siarohin et al. 2021] has the same first author as the famous "First Order Motion Model for Image Animation" (FOMM)
[Siarohin et al. 2019]. MRAA seeks to address the flaws of FOMM by representing movement through the change of the principal components of the area each body part. It also models background movement in order to not waste network resources on simple background motion. Moreover, it proposes a way to disentangle shape from motion in order to deal better with cross-identity motion transfer.

The paper "Thin-Plate Spline Motion Model for Image Animation" by Zhao and Zhang proposes another way to represent motion of body parts

[Zhano and Zhang 2022]. Here, each part has multiple keypoints. (5 are used in the paper.) As a part moves, the keypoints change their positions, and the motion of the part is determined by the thin-plate spline warp that results from the position changes
[Eberly 2022].

Lastly, "Structure-Aware Motion Transfer with Deformable Anchor Model" by Tao et al. proposes the deformable anchor model (DAM) as a represention for the character's movement

[Tao et al. 2022]. Like FOMM, a DAM consists of a number of keypoints, which are used to represent movement of body parts. However, they are now called "motion anchors." It also introduces a "latent root anchor" to represent the motion of the whole body. The movement of the motion anchors is regularized to be similar to that of the latent root anchor. The idea is that, if each keypoint correponds to a body part, this regularization should better preserve the relative position between the parts. The paper also introduces a hierarchical version of DAM where the anchors form a tree with motion anchors as leaves.

Unfortunately, I have not tried how well these approaches work on my dataset yet.

7   Conclusion

In this article, I have discussed my attempt to improve my animation-from-a-single-image system. By replacing two constituent networks with newly designed ones, I enabled the system to rotate the body and generate breathing motion, making its features closer to those offered by professionally-made Live2D models. The system outputs plausible animations for characters with simple designs, but it struggles on those with props not sufficiently covered by the training dataset. These include large hats, weapons, musical instruments, and other thin ornamental structures.

I also explored making the system more efficient through using depthwise separable convolutions and the "half" type. Employing both, I made the system 18 times smaller, descreased its GPU RAM usage by about 3 to 4 times, and also slightly improved its speed. While this makes it easier to deploy the system on less powerful devices, more research is needed to make it significantly faster.

8   Disclaimer

While I am an employee of Google Japan, this project is my personal hobby which I did in my free time without using Google's resources. It has nothing to do with work as I am a normal software engineer writing Google Maps backends for a living. Moreover, I currently do not belong to any of Google's or Alphabet's research organizations. Opinions expressed in this article is my own and not the company's. Google, though, may claim rights to the article's technical inventions.

9   Special Thanks

I would like to thank Andrew Chen, Ekapol Chuangsuwanich, Yanghua Jin, Minjun Li, Panupong Pasupat, Yingtao Tian, and Pongsakorn U-chupala for their comments.

A   Pose Parameters

The system in this article takes a 45-dimensional pose vector as input. I show the semantics of each parameter below. The character is Souya Ichika (© 774 inc.).

A.1   Eyebrow Parameters (12)

Index Name Semantics
0 eyebrow_troubled_left
1 eyebrow_troubled_right
2 eyebrow_angry_left
3 eyebrow_angry_right
4 eyebrow_lowered_left
5 eyebrow_lowered_right
6 eyebrow_raised_left
7 eyebrow_raised_right
8 eyebrow_happy_left
9 eyebrow_happy_right
10 eyebrow_serious_left
11 eyebrow_happy_right

A.2   Eye Parameters (12)

Index Name Semantics
12 eye_wink_left
13 eye_wink_right
14 eye_happy_wink_left
15 eye_happy_wink_right
16 eye_surprised_left
17 eye_surprised_right
18 eye_relaxed_left
19 eye_relaxed_right
20 eye_unimpressed_left
21 eye_unimpressed_right
22 eye_raised_lower_eyelid_left
23 eye_raised_lower_eyelid_right

A.3   Iris Parameters (4)

Index Name Semantics
24 iris_small_left
25 iris_small_right
37 iris_rotation_x
38 iris_rotation_y

A.4   Mouth Parameters (11)

Index Name Semantics
25 mouth_aaa
27 mouth_iii
28 mouth_uuu
29 mouth_eee
30 mouth_ooo
31 mouth_delta
32 mouth_lowered_corner_left
33 mouth_lowered_corner_right
34 mouth_raised_corner_left
35 mouth_raised_corner_right
36 mouth_smirk

A.5   Head Rotation Parameters (3)

Index Name Semantics
39 head_x
40 head_y
41 neck_z

A.6   Body Parmeters (3)

Index Name Semantics
42 body_y
43 body_z
44 breathing

Update History

Project Marigold