Most virtual YouTubers are affiliated with ANYCOLOR, Inc., cover corp, 774 inc., and Noripuro. The rest are independent. Copyrights of the images belong to their respective owners.
Abstract. I present my third iteration of a neural network system that can animate an anime character, given just only one image of it. While the previous iteration can only animate the head, this iteration can animate the upper body as well. In particular, I added the following three new types of movements.
Body rotation around the |
Body rotation around the |
Breathing |
With the new system, I updated an existing tool of mine that can transfer human movement to anime characters. The following is what it looks like with the expanded capabilities.
I also experimented with making the system smaller and faster. I was able to reduce significatly reduce its memory requirement (18 times reduction in size and 3-4 times reduction in RAM usage) and made it slightly faster while incurring little deterioration in image quality.
Since 2018, I have been a fan of the virtual YouTubers (VTuber). In fact, I like them so much that, starting from 2019, I have been doing two personal AI projects whose aims were to make it much easier to become a VTuber. In the 2021 version of the project, I created a neural network system that can animate the face of any existing anime character, given only its head shot as input. My system lets users animate characters without having to create controllable models (either 3D models by using softare such as 3ds Max, Maya or Blender, or 2.5D ones by using software such as Live2D or Spine) beforehand. It has the potential to greatly reduce the cost of avatar creation and character animation.
While my system can rotate the face and generate rich facial expressions, it is still far from practical as a streaming and/or content creation tool. One reason is that all movement is limited to the head. A typical VTuber model, however, can rotate the upper body to some extent. It also features a breathing motion in which the character's chest or the entire upper body would rhythmically wobble up and down even if the human performer is not actively controlling the character.
The system also has another major problem: it is resource intensive. It is about 600 MB in size and requires a powerful desktop GPU to run. In order to enable usage on less powerful computers, I must optimize the system's size, memory usage, and speed.
In this article, I report my attempt to address the above two problems.
For the problem of upper body movement, I extended my latest system by adding 3 types of movements: rotation of the hip around the
For the problem of high resource requirements, I experimented with two techniques to optimize my neural networks. The first is using depthwise separable convolutions
I created a deep neural network system whose purpose was to animate the head of an anime character. The system takes as input (1) an image of the character's head in the front facing configuration with its eyes wide open and (2) a 42-dimensional vector called the pose vector that specifies the character's desired pose. It then proceeds to output another image of the same character after being posed accordingly. The system can rotate the character's face by up to
The system is largely decomposed into two main subnetworks. The face morpher is tasked with changing the character's facial expression, and its design is documented in the write-up of the 2021 project. The face rotator is tasked with rotating the face, and its design is available in the write-up of the 2019 project. Figure 2.1 illustrates how the networks are put together.
|
Figure 2.1 An overview of my neural network system. |
For this article, the face rotator is especially relevant because it is the network that I have to redesign in order to expand the system's capability. The network itself is made up of two subnetworks. The two-algorithm face rotator uses two image tranformation techniques to generate images of the character's rotated face, and the combiner merges the two generated images to create the final output.
|
Figure 2.2 The face rotator. |
The two image tranformation techniques are:
When the face is rotated by a few degrees, most changes to the input image can be thought of as moving existing pixels to new locations. Warping can thus handle these changes very well, and the generated image would be sharp because existing pixels would be faithfully reproduced. Nevertheless, warping cannot generate new pixels, which are needed when unseen parts of the head become visible after rotation. Partial image change, on the other hand, can generate new pixels from scratch, but they tend to be blurry. By combining both approaches
In this large section, I discuss how I extended my 2021 system so that it can move the body as well. I will start by defining exactly the problem I would like to solve (Section 3.1). Then, I will give a brief overview of the whole system. In particular, I will discuss which networks from the previous projects I reused and which I created anew (Section 3.2). Next, I will elaborate on how I generated the datasets to train the new networks (Section 3.3). I will then describe the networks' architectures and training procedures (Section 3.4), and lastly I will evaluate the networks' performance (Section 3.5).
As with the previous version of the system, the new version in this article takes as input an image of a character and a pose vector. The image is now of resolution
|
Figure 3.1.1 A valid input to the new version of the system. The character is Kizuna AI (© Kizuna AI). |
The character's eyes must be wide open, but the mouth can be either completely closed or wide open. However, while the character's head must be front facing in the old version, the new version relaxes this constraint. The head and the body can be rotated to by a few degrees. The arms can be posed rather freely, but they should generally be below and far from the face. Allowing these variations makes the system more versatile because it is hard to find images of anime characters in the wild whose face is perfectly front facing and whose body is perfectly upright.
|
|
|
|
|
|
|
|
Figure 3.1.2 Examples of valid input images to the system. |
The input image must have an alpha channel. Moreover, for every pixel that is not a part of the character, the RGBA value must be
The pose vector now has 45 dimensions, and you can see the semantics of each parameter in Appendix A. 42 parameters have mostly the same semantics as those in the last version of the system. The only changes from the old version are the ranges of the parameters for head rotation around the
There are three new parameters, and they control the body.
With the above three parameters, it becomes possible to move a character's upper body like how typical VTubers move theirs.
Note that I previously mentioned that I reduced the range of the head rotation around the
Lastly, let us recall the output's specification. After being given an image of a character and a pose vector, the system must produce a new image of the same character, posed according to the pose vector.
Figure 3.2.1 gives an overview of the new version of the system. It is similar to the old one (Figure 2.1), but now it deals with the upper body instead of just only the face. It still has two steps, and the first step still modifies the character's facial expression. For this step, I reuse the face morpher network that is the centerpiece of the previous year's project. The second step must not only rotate both the face and the body but also make the character breath, so the old face rotator from 2019 cannot be used. The network for the second step is now called the body rotator, and it must be designed and trained anew.
|
Figure 3.2.1 An overview of the new version of the system. |
We must now prepare datasets to train the body rotator. Continuing the practice I adopted in previous projects, I created them by rendering 3D models created for the animation software MikuMikuDance, and I reused a collection of around 8,000 MMD models I manually downloaded and annotated. Details on how I created the collection can be found here and here.
A dataset's format must follow the specification of the body rotator's input and output. In particular, the input consists of two objects. One is a
One main difference between the body rotator and the face rotator from the 2021 project is the character's body pose in the input image. For the face rotator, the character must be in the "rest" pose. In other words, the face must be looking forward and must not be tilted or rotated sideways. Moreover, the body must be perfectly upright. The arms must stretch straight sideways and point diagonally downward. (See Figure 3.3.1.1.) On the other hand, as stated in Section 3.1, the new body rotator must be able to accept variations in the initial pose like in Figure 3.1.2.
|
Figure 3.3.1.1 The MMD model of Kizuna AI in the rest pose. |
This requirement makes data generation harder. For the old version, I only have to render an MMD model without posing it because MMD modelers almost always create their models in the rest pose to begin with. On the contrary, the new version requires an MMD model to be posed twice. It must take a non-rest pose in the input image, and then that pose must be altered according to the pose vector in the output image.
One must then figure out what poses to use in the input images, and my answer is to use those shared by the MMD community. I downloaded pose data in VPD format, created specifically for MMD models from web sites such as Nico Nico and BowlRoll and ended up collecting 4,731 poses in total. However, a pose may not be usable for several reasons.
I created a tool that allowed me to manually classify whether a pose is usable or not through visual inspection. With it, I identified about 832 usable poses (a yield of 19.1%). You can see the tool in action in the video below.
One way to specify the pose in the input image is to uniformly sample a usable pose from the collection above. However, I felt that using just 832 poses was not diverse enough, so I augmented the sampled pose further. After sampling a pose from the collection, I sample a "rest pose" by sampling the angle the arms should make with the
|
Figure 3.3.1.2 The process to sample a pose to be used in the input image. |
Note that a pose of an MMD model is a collection of two types of values.
Blending two poses together thus involves interpolating the above values. More specifically, we perform linear interpolation on the blendshape weights, and spherical linear interpolation (slerp) on the quaternions.
In order to generate training examples, one must determine what each component of the pose vector means in terms of MMD models. For example, when the breathing parameter is, say,
Let us start with the one that is the easiest to describe: the hip
For the semantics of the hip
The semantics of the breathing parameter is more involved as most MMD models do not have bones or morphs for specifically controlling breathing. As a result, I have to define what breathing means on my own, and I chose to modify the translation parameters of a 5 bones in the chest area.
|
Figure 3.3.2.2.1 Bones modified to enact the breathing motion. |
When we inhale, our lungs expands, and it pushes our chest both outward and upward. To simulate this movement, I set the translation parameter of the "upper body" bone to the vector
However, we can also see that it also has the side effect of making the head and the shoulders move diagonally back and forth. Nevertheless, when we breathe, our head and shoulders rarely move. To keep them stationary, I also set the translation parameters of three remaining bones (i.e., left shoulder, right shoulder, and neck) to
The maximum displacement
When the model is viewed from the front, we can see that the chest moves up and down while the head and the shoulders remain stationary. This movement gives the impression that the character is breathing as we wanted
There are three head rotation parameters:
There are no changes to the bones and the axes above. However, I changed how the parameters affect the model's shape.
Typically, when a bone of an 3D model is rotated, bones that are children of that bone and vertices that these bones influence also move. For example, when one rotates the neck bone, vertices on the neck, the whole head, and also the whole hair mass would also rotate with the neck, as can be seen in the video below.
Figure 3.3.2.3.1 The typical result of rotating the neck bone around the |
This behavior, however, makes it very hard for a neural network to animate characters with long hair. First, it must identify correctly which part of the input image is the hair and which part is the body and the arms that are in front of it. This is a hard segmentation problem that must be done 100% correctly. Otherwise, disturbing errors such as severed body parts or clothing might show up. Second, as the hair mass moves, parts that were occluded by the body can become visible, and the network must hallucinate these parts correctly. Note that these difficulties do not exist in the previous version of my system because it could only animate headshots. We cannot see long hair in the input to begin with!
|
|
Figure 3.3.2.3.2 TTo generate the video on the right, I used a model to animate the image on the left, but it was trained on a dataset where the whole hair mass moves with the head like in Figure 3.3.2.3.1. In the bottom half of the video, we can see that the details of the hair strands are lost. Moreover, the model seemed to think that the character's hands were a part of the hair, so it cut the fingers off when the hair moved. The character is Enomiya Milk (© Noripro). |
While it may be possible to solve the above problems with current machine learning techniques, I realized that it was much easier to avoid them and still generate plausible and beautiful animations. The difficulty in our case comes from long-range dependency: a small movement of the head leads to large movement elsewhere faraway. The situation becomes much easier if hair pixels far from the head were kept stationary.
I thus modified the skinning algorithm for MMD models so that the neck and the head bones can only influence vertices that are not too far below the vertical position of the neck. The new algorithm's effect can be seen in the following video.
Figure 3.3.2.3.3 Hair movement after limiting the influence of the neck and head bones. We can see that the hair mass below the shoulders does not move at all, and this make the system's job much easier. |
To recap, the head rotation parameters still correspond to rotating the same bones around the same axes. However, the influence of the these bones is limited to vertices that are not too far below the neck so that head movement cannot cause large movement elsewhere. This greatly simplifies animating characters with very long hair, which are quite common in illustrations in the wild.
Character illustrations in the wild often depict shadows casted by the head on the neck. We can clearly see that the skin just below the chin is often much darker than the face.
Figure 3.3.3.1 Three drawn characters with neck shadows. The characters are Fushimi Gaku, Hayase Sou, and Honma Himawari. They are © ANYCOLOR, Inc. |
However, in 3D models, the neck and face skin often have exactly the same tone. Thus, a neck shadow would be absent if a model is rendered without shadow mapping or other shadow-producing techniques. I chose not to implement such an algorithm because it would require much effort and would greatly complicate my data generation pipeline. As result, my previous datasets do not have neck shadows and are quite different from real-world data.
When a character turns its face upward, the area of the neck previously occluded by the chin would become visible, and the network must hallucinate the pixels there. Ideally, if the neck shadow is present, the hallucinated pixels should have the same color. However, training a neural network with my previous datasets can lead to a problem where these pixels would be brighter than the surrounding shadow because it is fine to use the face skin's color during training. The figure below shows two such failure cases.
|
||||||
Figure 3.3.3.2 Failure cases in which hallucinated neck pixels are brighter than the surrounding neck shadows. The characters are Suzuhara Lulu (top) and Ex Albio (bottom). They are © ANYCOLOR, Inc. |
To alleviate the problem without having to implement a full-blown shadow algorithm, I simulated neck shadows by simply rendering the neck under a different lighting configuration than the rest of the body. Like the previous versions of the project, two light sources are present in the scene. As such, when implementing the fragment shader of my renderer, I only had to condition their intensities on whether the fragment being shaded belongs to the neck or not. The result of this rendering method can be seen in the figure below.
|
||||
Figure 3.3.3.3 An MMD model rendered (a) conventionally and (b) with simulated neck shadow. The character is Yamakaze from the game Kantai Collection. The 3D model was created by cham. |
When generating training examples, we must then provide two sets of lighting intensities so that one can be used to render the body, and the other can be used to render the neck. In the dataset I generated, I sampled the intensities so that the following properties hold:
The sampling method above, I believe, would allow the network to deal with the variety of character illustrations in the wild.
A dataset is a collection of training examples, and a training example in our case is a triple
|
||||||
Figure 3.3.4.1 A training example. |
The process of generating the above data items is rather involved. It requires sampling an MMD model, an input pose as in Section 3.3.1, a 6-dimensional pose vector
Listing 3.3.4.2 Algorithm for generating a training example |
---|
|
I followed the same dataset generation process as the previous versions of the projects. Before data generation, I divided the models I downloaded into three groups according to their source materials (i.e., what animes/mangas/games they came from) so that no two groups had models of the same character. I then used the three groups to generate the training, validation, and test datasets. The number of training examples and the number of models used to generate them are given in the table below.
Training set | Validation set | Test set | |
# models used | 7,827 | 79 | 68 |
# examples | 500,000 | 10,000 | 10,000 |
I have just described the datasets for training the body rotator. In this section, I turn to the network's architecture and training process.
In my first attempt to design the body rotator, I reused the face rotator's architecture. There would be two subnetworks. The first one would produce two outputs, and the second one would then combine them into the final output image.
|
Figure 3.4.1.1 The face rotator's architecture. (This figure is the same as Figure 2.2. It is reproduced here for the reader's convenience.) |
The difficulty, however, is the input's size: a
My strategy, then, is to scale the input image down to
Note that the editor is the only network that operates on full-resolution images, but it can afford to have lower capacity per input pixel because its task is much easier than that of the half-resolution rotator. We will see later that this design keeps the body rotator speedy enough for real-time applications despite the fact the input is now 4 times larger.
The two networks do not follow the old design exactly. Like the two-algorithm face rotator, the half-resolution rotator still uses two image transformation techniques to produce output images, but they are not the same as the old ones.
The editor is similar to the combiner. However, instead of taking in outputs from both image tranformation techniques of the half-resolution rotator, it now only takes those from the warping one. The image created by direction generation is always discarded.
|
Figure 3.4.1.2 The overall architecture of the body rotator. |
While direct generation seems wasteful and superfluous, it serves as an auxiliary task at training time, and I found that it improved the whole pipeline's performance. This counterintuitive design is a result of evaluating many design alternatives and choosing the best one. (More on this later in Section 3.5.)
I will now discuss each of the subnetworks in more details.
The half-resolution rotator's architecture is derived from that of the two-algorithm face rotator from my 2019 project. So, it is built with components that I previously used. These include the image features (alpha mask, image change, and appearance flow offset), the image transformation units (partial image change unit, combining unit, and warping unit), and various units such as
|
Figure 3.4.2.1 An overview of the half-resolution rotator's architecture. |
From the above figure, the half-resolution rotator has an encoder-decoder main body, which takes in a
|
Figure 3.4.2.2 The direct generation unit. |
The outputs of these units are treated as the outputs of the half-resolution rotator.
Recall that, in the two-algorithm face rotator of the 2019 project, the partial image change unit is used because, in the 2019 problem specification, the body does not move at all, so the network only has to change pixels belonging to the head. However, for the current problem specification, if any of the parameters that control the hip rotation is not zero, then every pixel would change. As a result, it becomes more economical to generate all output pixels directly, and so I replaced partial image change with direct generation.
The specification of the encoder-decoder network is given the the table below.
|
||||||||||||||||||||||||||||||||
Table 3.4.2.3 Specification of the encoder-decoder network that is the main body of the half-resolution rotator. Note that |
The encoder-decoder network above is an upgraded version of the one used in my 2021 project. Differences from the old design include:
Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and Checkerboard Artifacts. Distill. 2016. [WWW]
The units used to build the network are largely the same, but the semantics of some have slightly changed, and I also introduced a number of new ones.
The half-resolution rotator is about 128 MB in size.
Training procedure. I trained the half-resolution rotator using a process similar to that of the two-algorithm face rotator. Training is divided into two phases. In the first phase, the loss function was the L1-norms of the differences between the two generated images and the direct image.
The reader may notice the imbalance between the weights. The L1 losses,
I trained the network with the Adam algorithm, setting
Recall from Figure 3.4.1.2 that the outputs of the half-resolution rotator are scaled up by a factor of 2. Then, they are fed to the editor along with the original
Unlike the half-resolution rotator, the editor's main body is a U-Net
After being fed the all the inputs, the main body produces a feature tensor, which is then fed to a number of image processing steps, leading to the final output. The steps are:
In other words, the editor further modifies the appearance flow offset created by the half-resolution rotator. Ideally, it should add high-frequency details that the rotator could not generate. The editor then "retouch" the the warped image generated by the improved appearance flow offset through partial image changes. The whole process is summarized in the Figure below.
|
Figure 3.4.3.1 An overview of the editor's architecture. |
The Specification of the U-Net network is given in the table below.
|
||||||||||||||||||||||||||||||||||||||||||
Table 3.4.3.2 Specification of the U-Net network that is the main body of the editor. |
Let us note that the U-Net has lower capacity per input pixel than the encoder-decoder of the half-resolution rotator (Table 3.4.2.3). This can be seen from the number of channels of
Recall that the time complexity of a convolution on a
The editor is about 33 MB in size, which is about one forth the size of the half-resolution rotator. This is because a convolution unit that operates on a
Training procedure. I used a loss function with 4 terms.
Here,
The
|
Figure 3.4.3.3 The neck subimage that is used to evaluate |
Let us denote the neck subimage of
The coefficients of the terms were
Training the editor requires the half-resolution rotator because we have to use it to generate two of the editor's inputs. However, I froze its parameters so that only the editor's parameters were updated during training. Again, I used the Adam algorithm with
The design of the body rotator I presented the last section is rather counterintuitive: the half-resolution rotator has an output that is always discarded. I came to this design by picking the best one out of many variations.
I evaluated 2 designs for the half-resolution rotator and 6 designs for the editor. I also considered a design where a standalone network performs the whole body rotation task. Because some half-resolution rotator designs are not compatible with certain editor designs, there were 10 valid variations in total.
Half-resolution rotator designs. Of course, one the of the designs is the one I presented in Section 3.4.2. The other design is a variation of that design in which the direct generation branch is removed. Let us refer to the simpler design as "Rotator A," and the design in Section 3.4.2 as "Rotator B."
|
|
Rotator A | Rotator B |
Rotator B was trained with the process described in Section 3.4.2. Rotator A's process was similar because the only changes were the loss functions. The first phase's loss function was
Editor designs. I considered 6 designs for the editor. All of them has the same U-Net (Table 3.4.3.2) as their main bodies, but they differ in what inputs they take in and how they generate the final output image.
As for the inputs, there are 5 data items that the editor can take in.
So, an editor's inputs must form a subset of
In other words, all editors must take in
As for ways to generate the final output image
Note that Approach
Name | Inputs | How to Generate |
---|---|---|
Editor U | Approach |
|
Editor V | Approach |
|
Editor W | Approach |
|
Editor X | Approach |
|
Editor Y | Approach |
|
Editor Z | Approach |
|
|
Editor U | Editor V |
|
|
Editor W | Editor X |
|
|
Editor Y | Editor Z |
Note that Editor Y is the one previously presented in Section 3.4.3.
To form a complete body rotator, we must connect a half-resolution rotator with an editor. We can see that Rotator A cannot work with any editor that takes
Editor U | Editor V | Editor W | Editor X | Editor Y | Editor Z | |
---|---|---|---|---|---|---|
Rotator A | 〇 | ✕ | 〇 | ✕ | 〇 | ✕ |
Rotator B | 〇 | 〇 | 〇 | 〇 | 〇 | 〇 |
All editors were trained with the process in Section 3.4.3. Note that I must train two copies of each of Editor U, W, and Y because they all belong to two different body rotator designs. One copy must be trained with Rotator A and the other with Rotator B.
Single network design. I also evaluated an architecture where a single network is responsible for performing the body rotation task end-to-end. The network has an encoder-decoder main body whose construction is similar to that in Table 3.4.2.3. However, because the input is
|
Figure 3.5.1.4 The single network rotator design. |
The network was trained in two phases. In the first phase, the loss function was
Quantitative evaluation. I fed each design the test dataset and have it produce one output image per test example. I then computed the similarity between the output and the ground truth image with three similarity metrics: the root mean square error (RMSE), the Structural Similarity (SSIM) metric
Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, and Eero P. Simoncelli. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, Vol. 13, No 4, April 2004. [Paper]
Body Rotator Design | RMSE | SSIM | LPIPS |
Rotator A + Editor U | 0.15529800 | 0.90527600 | 0.06305300 |
Rotator A + Editor W | 0.15511400 | 0.90652400 | 0.06088700 |
Rotator A + Editor Y | 0.16160900 | 0.90324800 | 0.05088500 |
Rotator B + Editor U | 0.15586100 | 0.90644100 | 0.05237600 |
Rotator B + Editor V | 0.15574700 | 0.90663700 | 0.05211400 |
Rotator B + Editor W | 0.15550700 | 0.90823300 | 0.05051800 |
Rotator B + Editor X | 0.15545500 | 0.90821200 | 0.05051200 |
Rotator B + Editor Y | 0.15437500 | 0.90950700 | 0.04874800 |
Rotator B + Editor Z | 0.15870300 | 0.90583800 | 0.05042200 |
Single network rotator | 0.17180100 | 0.89646500 | 0.05791900 |
First, we can see that the single network design performed worse than all two-network designs according to the RMSE and SSIM metrics. Moreover, its LPIPS score is also much higher than the best two-network design. This result informs us that two networks are better than one.
Next, one can see that "Rotator B + Editor Y" was the best because all of its three metrics were the best. Interestingly, "Rotator B + Editor Z" has slightly more capacity than "Rotator B + Editor Y," yet it performed worse according to all metrics. An explanation for this result might be that taking the direct image as input actually diverted the network's attention away from how to process the warped image and its appearance flow offset.
"Rotator B + Editor Y" is also better than "Rotator A + Editor Y" on all metrics. In other words, although we discard direct images generated by Rotator B, having the rotator generate them actually leads to better inputs for the editor. This is an example of improving a task's performance by training a network to also solve related auxiliary tasks
Qualitative evaluation. I created a sequence of pose vectors that contain all 6 types of movements controllable by the body rotator. I then used the designs to animate pictures of eight MMD models according to the pose vector sequence to render 10 videos. I also converted the pose sequence into ones that were applicable to the MMD models and animated them to create ground truth videos. The videos, arranged side by side for comparison, are available in Figure 3.5.1.6. Another version of the videos where only the faces are shown are available in Figure 3.5.1.7.
Target Character: |
Target Character: |
From the animations, we can see the reasons why the chosen design (Rotator B + Editor Y) were clearly better than some alternatives.
Comparison against single network design. The single network design can produce black contours that are too large.
Ground truth | Single network | Rotator B + Editor Y (Section 3.4) |
It can also erase thin structures too aggressively.
Ground truth | Single network | Rotator B + Editor Y (Section 3.4) |
Comparison against "Rotator A" designs. "Rotator A + Editor U" and "Rotator W" can produce aliasing artifacts and remove high frequency details from the outputs. (See the character's eyes in the figure below.)
Ground truth | Rotator A + Editor U | Rotator A + Editor W | Rotator B + Editor Y (Section 3.4) |
Moreover, I also observed that designs with "Rotator A" could produce very incorrect distortions.
Ground truth | Rotator B + Editor Y (Section 3.4) |
|
Rotator A + Editor U | Rotator A + Editor W | Rotator A + Editor Y |
There were, however, no major differences between the outputs of the chosen design (Rotator B + Editor Y) and other designs with "Rotator B." So, the choice of Editor Y was mainly informed by the quantitative comparison.
In conclusion, I chose the "Rotator B + Editor Y" design because its performance, as indicated by the similarity metrics, was the best. Moreover, its outputs also looked better than ones produced by the single network design and those with "Rotator A."
There are many previous works that can generate animation from a single image, and I have surveyed them in details in the write-up of my 2021 project. All of my VTuber-related projects solve the problem of parameter-based posing where the input consists of a single image of the character and a pose vector, and the task is to pose the character accordingly. There is a related problems of motion transfer, where we are given an image or a video of a subject (the source), and we have to make another subject (the target) imitate the pose of the source.
In the 2021 project, I compared my system to those proposed by Averbuch-Elor et al.
There were previous works on parameter-based posing, but they were either already a part of my system (such as Pumarola et al.'s work) or not convenient to compare against (such as Ververas's and Zaferiou's
PIRenderer's overall design is similar to that of my body rotator. There is a network the produces a low-resolution appearance flow, and it is followed by a network that tries to refine the resulting warped image. However, the authors seem to be inspired by StyleGAN
While the system's source code is publicly available, I did not use it directly because it was easier for me to reimplement and adapt it to my coding framework. I also introduced the following changes.
First, to make PIRenderer compatible with my system, I made the mapping network take a single pose vector as input instead of a window of pose vectors inside an animation sequence. This change simplifies its architecture because the "center crop" unit is no longer needed. The mapping network thus became a multi-layer perceptron (MLP) that turns a 6-dimensional pose vector into a 256-dimensional latent vector . Each of its hidden layer has 256 units.
Second, to make a fair comparison between PIRenderer and my system, I adjusted PIRenderer's hyperparameters to make its networks roughly the same size as my networks. In particular, I raised the number of of maximum channels in each layer from 256 to 512 and then made the following adjustments.
Third, I changed PIRenderer's training process to be similar to the way I trained my networks. Training has two phases. In the first phase, the mapper and the warper were trained together for 12 epochs (6,000,000 examples). In the second phase, the mapper are the warper were frozen, and the editor were trained for 6 epochs (3,000,000 examples). Both phases used batch size of 8, learning rate of
Quantitative evaluation. I evaluated my system against PIRenderer with the three image similarity metrics used in the last subsection. The scores, computed with the test dataset, are given in the table below.
System | RMSE | SSIM | LPIPS |
PIRenderer | 0.16772200 | 0.89525700 | 0.05257600 |
My system (Section 3.4) | 0.15446600 | 0.90975400 | 0.04807400 |
Qualitative evaluation. I used PIRenderer and my system to generate animations of the eight 3D models used to quantitatively compare my system against alternative architectures. The results are given in Figure 3.5.2.2 (full body) and Figure 3.5.2.3 (face zoom).
Target Character: |
Target Character: |
Figure 3.5.2.4 lists some differences I could observe from the animations. In general, my system produced blurrier images than PIRenderer did, and it could also erase thin structures such as hair strands and ribbons while PIRenderer preserved them better. However, I preferred my system over PIRenderer because the latter could deform faces in undesirable ways while my system preserved their shapes better.
Ground truth | PIRenderer | My System | |
|
|||
Ground truth | PIRenderer | My System | |
|
|||
Ground truth | PIRenderer | My System | |
|
|||
Ground truth | PIRenderer | My System | |
|
A major problem with PIRenderer is its overuse of warping. This becomes the most noticeable when a character wears cloth with a turtleneck that extends upward to just below the chin when seen from the front. When the character turns its face up, PIRenderer would use warping to lift the chin, but the warping would also drag the turtleneck with the chin as well. My system, on the other hand, can use local image change to hallucinate the neck skin that is supposed to become visible, resulting in a much more plausible output.
In conclusion, my system was better at rotating an anime character's head and body than PIRenderer. It better preserved the head's shape, produced fewer artifacts, and could hallucinate disoccluded neck skin. On the other hand, PIRenderer would drag pixels just below the chin around when the face moved.
In the last section, I proposed a new body rotator network that can animate the upper body of anime characters. However, recall from Figure 3.2.1 that it is a part of a larger system that can also modify facial expression. While the system as a whole became a little smaller (517 MB instead of 600 MB) due to the editor being smaller than the combiner, it is still quite large. It also takes about 35 ms to process an image end to end using my Titan RTX graphics card. The large size makes it impractical to deploy it on mobile devices, and processing speed would only worsen when using less powerful GPUs. It is thus crucial to make the system smaller and faster in order to improve its versatility. I discuss my attempt to do so in this section.
To improve my system's efficiency, I experimented with two techniques.
Depthwise separable convolutions. The technique was introduced by Sifre in 2014
All networks in my system are convolutional neural networks (CNNs), meaning that their main building blocks are convolution layers. Such a layer typically takes in a tensor of size
To improve the networks' efficiency, I replaced all convolution layers in their main bodies (i.e., the encoder-decoder networks and the U-Nets) with two convolution layers that are applied in succession.
The effect of separating a standard convolution layer into two is that the time complexity reduces to
Note that, while depthwise separable convolutions can reduce the amount of memory needed to store model parameters, it does not reduce the amount of memory needed to store data tensors while inference is taking place. More specifically, if
Using "half." I implemented my network with PyTorch, which uses the 32-bit floating point type (aka float) to represent almost all data and parameters. Nevertheless, the numerical precision afforded by float might not be necessary, and so PyTorch features an option to use the 16-bit floating type (aka half) instead. While I was not sure a priori how much using half can improve processing speed, there's the obvious benefit that both the system's size and RAM usage would become two times smaller.
The whole system consists of 5 subnetworks. (3 from the 2021 system, and 2 from Section 3.4.) For each subnetwork, I created four variants based on the combination of techniques used. If a variant uses depthwise separable convolutions, it is designated with the word "separable;" otherwise, it is designated with the word "standard." A variant is designed with the floating point type it uses. As a result, the variants are referred to as "standard-float," "separable-float," "standard-half," and "separable-half."
To create the variants, I trained stardard-float and seperable-float models from scratch. I then created standard-half and separable-half models from the corresponding "float" models by convering all parameters to "half." Note that standard-float is the variant that receives no efficiency improvements, serving as the control group.
The clearest benefit of the techniques is size reduction. As predicted, using depthwise separable convolution reduced the size by a factor of about 9, and using half cuts it further in half.
Size in MB (and improvement over standard-float) | ||||
Networks | standard-float | separable-float | standard-half | separable-half |
Eyebrow segmenter |
120.11
(1.00x) |
12.70
(9.46x) |
60.07
(2.00x) |
6.36
(18.87x) |
Eyebrow warper |
120.32
(1.00x) |
12.72
(9.46x) |
60.17
(2.00x) |
6.37
(18.88x) |
Eye & mouth morpher |
120.59
(1.00x) |
12.75
(9.45x) |
60.31
(2.00x) |
6.39
(18.86x) |
Half-resolution rotator |
124.62
(1.00x) |
13.69
(9.10x) |
62.32
(2.00x) |
6.86
(18.16x) |
Editor |
31.92
(1.00x) |
3.63
(8.80x) |
15.97
(2.00x) |
1.83
(17.45x) |
Whole system |
517.56
(1.00x) |
55.48
(9.33x) |
258.84
(2.00x) |
27.82
(18.60x) |
As noted earlier, making the networks smaller does not always means that they would overall use less memory. To see the techniques' impact on memory requirement, I measured GPU RAM usage by the following process.
I conducted experiments on three computers:
The measurement values are given in the tables below.
RAM usage in MB on Computer A (and improvement over standard-float) |
||||
Networks | standard_float | separable_float | standard_half | separable_half |
---|---|---|---|---|
Eyebrow segmenter |
158.34
(1.00x) |
38.51
(4.11x) |
94.68
(1.67x) |
25.32
(6.25x) |
Eyebrow warper |
159.10
(1.00x) |
40.27
(3.95x) |
95.17
(1.67x) |
25.71
(6.19x) |
Eye & mouth morpher |
152.72
(1.00x) |
71.94
(2.12x) |
94.10
(1.62x) |
49.37
(3.09x) |
Half-resolution rotator |
209.98
(1.00x) |
113.03
(1.86x) |
121.32
(1.73x) |
80.72
(2.60x) |
Editor |
312.22
(1.00x) |
349.10
(0.89x) |
207.46
(1.50x) |
240.81
(1.30x) |
Whole System |
816.91
(1.00x) |
417.49
(1.96x) |
467.39
(1.75x) |
274.80
(2.97x) |
RAM usage in MB on Computer B (and improvement over standard-float) |
||||
Networks | standard_float | separable_float | standard_half | separable_half |
---|---|---|---|---|
Eyebrow segmenter |
158.34
(1.00x) |
38.51
(4.11x) |
86.17
(1.84x) |
19.31
(8.20x) |
Eyebrow warper |
159.10
(1.00x) |
40.27
(3.95x) |
86.92
(1.83x) |
19.70
(8.08x) |
Eye & mouth morpher |
152.72
(1.00x) |
71.94
(2.12x) |
83.75
(1.82x) |
35.58
(4.29x) |
Half-resolution rotator |
209.98
(1.00x) |
113.03
(1.86x) |
104.80
(2.00x) |
56.71
(3.70x) |
Editor |
312.22
(1.00x) |
349.10
(0.89x) |
155.96
(2.00x) |
176.81
(1.77x) |
Whole System |
816.91
(1.00x) |
417.49
(1.96x) |
417.39
(1.96x) |
210.80
(3.88x) |
RAM usage in MB on Computer C (and improvement over standard-float) |
||||
Networks | standard_float | separable_float | standard_half | separable_half |
---|---|---|---|---|
Eyebrow segmenter |
142.84
(1.00x) |
38.54
(3.71x) |
71.42
(2.00x) |
19.32
(7.39x) |
Eyebrow warper |
143.61
(1.00x) |
40.30
(3.56x) |
71.81
(2.00x) |
19.72
(7.28x) |
Eye & mouth morpher |
144.22
(1.00x) |
71.98
(2.00x) |
71.86
(2.01x) |
35.59
(4.05x) |
Half-resolution rotator |
209.98
(1.00x) |
113.07
(1.86x) |
108.80
(1.93x) |
56.33
(3.73x) |
Editor |
312.22
(1.00x) |
349.11
(0.89x) |
156.96
(1.99x) |
176.81
(1.77x) |
Whole System |
813.91
(1.00x) |
414.08
(1.97x) |
415.89
(1.96x) |
209.30
(3.89x) |
From the data, we can spot a number of trends.
Firstly, while GPU RAM usages by the same network did not differ much between the computers, we can see that those with more capacity tended to use more RAM. This might be because PyTorch collected garbage more aggressively on computers with less available memory.
Secondly, the amount of RAM used by a network was always more than the network's size. This is assuring because it means that my measurement method properly took into account of model parameters.
Thirdly, using half reduced memory usage by factors close to 2 on all machines. This is consistent with the expectation that the half type should halve the amount of space needed to store everything.
Fourthly, depthwise separable convolutions reduced RAM usage of all networks except the editor. This can be explained by the observation that the technique can decrease memory used to store model parameters by about 9 times, but it cannot decrease memory used to store data tensors at all. The editor (
To summarize, using half reduced memory requirement under all circumstances. However, using depthwise separable convolutions was only beneficial to large networks (i.e., all networks except the editor). All in all, using both techniques resulted in about 3-4 times reduction of RAM usage for the whole system.
Another metric we care about is how fast the system is. To measure processing speed, I ran the whole system and the individual networks 100 times with artificial inputs (again, tensors whose values are all zeros) of batch size 1, measured the wall clock time of each run, and then computed the average processing time
Average processing time in milliseconds on Computer A (and improvement over standard-float) |
||||
Networks | standard-float | separable-float | standard-half | separable-half |
Eyebrow segmenter |
4.159
(1.00x) |
4.792
(0.87x) |
4.996
(0.83x) |
5.212
(0.80x) |
Eyebrow warper |
4.726
(1.00x) |
5.772
(0.82x) |
5.450
(0.87x) |
5.689
(0.83x) |
Eye & mouth morpher |
7.163
(1.00x) |
5.223
(1.37x) |
5.495
(1.30x) |
5.608
(1.28x) |
Half-resolution rotator |
8.865
(1.00x) |
6.001
(1.48x) |
5.094
(1.74x) |
5.605
(1.58x) |
Editor |
13.699
(1.00x) |
11.185
(1.22x) |
7.469
(1.83x) |
7.986
(1.72x) |
Whole system |
34.105
(1.00x) |
26.777
(1.27x) |
23.803
(1.43x) |
24.540
(1.39x) |
Average processing time in milliseconds on Computer B (and improvement over standard-float) |
||||
Networks | standard-float | separable-float | standard-half | separable-half |
Eyebrow segmenter |
5.125
(1.00x) |
5.024
(1.02x) |
5.168
(0.99x) |
5.064
(1.01x) |
Eyebrow warper |
6.717
(1.00x) |
5.441
(1.23x) |
5.887
(1.14x) |
5.356
(1.25x) |
Eye & mouth morpher |
8.522
(1.00x) |
6.373
(1.34x) |
8.068
(1.06x) |
6.014
(1.42x) |
Half-resolution rotator |
11.259
(1.00x) |
9.246
(1.22x) |
10.592
(1.06x) |
8.316
(1.35x) |
Editor |
18.374
(1.00x) |
24.277
(0.76x) |
13.670
(1.34x) |
19.273
(0.95x) |
Whole system |
43.841
(1.00x) |
46.959
(0.93x) |
38.019
(1.15x) |
38.848
(1.13x) |
Average processing time in milliseconds on Computer C (and improvement over standard-float) |
||||
Networks | standard-float | separable-float | standard-half | separable-half |
Eyebrow segmenter |
100.705
(1.00x) |
31.590
(3.19x) |
121.911
(0.83x) |
31.578
(3.19x) |
Eyebrow warper |
106.348
(1.00x) |
32.270
(3.30x) |
131.114
(0.81x) |
32.172
(3.31x) |
Eye & mouth morpher |
132.273
(1.00x) |
70.078
(1.89x) |
240.195
(0.55x) |
73.969
(1.79x) |
Half-resolution rotator |
211.828
(1.00x) |
72.645
(2.92x) |
345.056
(0.61x) |
91.863
(2.31x) |
Editor |
269.462
(1.00x) |
157.015
(1.72x) |
412.179
(0.65x) |
192.638
(1.40x) |
Whole system |
690.751
(1.00x) |
335.364
(2.06x) |
1125.345
(0.61x) |
385.041
(1.79x) |
While there were clear and predictable patterns in network sizes and RAM usages, changes in processing time were much less predictable.
On all machines, however, employing both techniques greatly reduced the system's memory requirement and also had positive (while not impressive) impact on speed. Thus, one should always apply them together.
Each of the techniques resulted in fewer bits being used to represent model parameters. As a result, it is expected that smaller variants would be less accurate. This can be confirmed by computing the similarity metrics on the test set. We can see that, because depthwise separable convolutions reduce network sizes by a factor of 9, it has more impact on the metrics than using half.
Variants |
RMSE ( |
SSIM ( |
LPIPS ( |
standard-float | 0.15518600 | 0.90928100 | 0.04903900 |
separable-float | 0.16107100 | 0.90425300 | 0.05354000 |
standard-half | 0.15551100 | 0.90892600 | 0.05008000 |
separable-half | 0.16141600 | 0.90390400 | 0.05458700 |
I also rendered animations using the 4 variants for qualitative comparisons.
Target Character: |
Target Character: |
The variants with the same type of convolution layers produced virtually the same animations because the parameters were essentially the same but with different precisions. Variants with different convolution layers, however, produced different mouth shapes. Mouths produced by separable convolutions were smaller than those produced by standard ones. For Kiso Azuki, the mouth did not move at all, showing that separable convolutions can yield inaccurate results in some cases. This might be the price to pay for a significant reduction in model size.
In conclusion, the techniques I experimented on this section, when combined, were effective at reducing the system's size and RAM usage. Additionally, they had positive but not significant impact (no more than 2x) on speed. The techniques surely made the system more employable on less powerful devices, but there is still much room for improvement in processing time.
As with previous versions of the system, my end goal is to animate drawings, not 3D renderings. In this section, I demonstrate how the system performs on such inputs.
I used the system to animate 72 images of VTubers and related characters. Sixteen resulting videos are available below, and the rest can be seen in Figure 5.1.2.
Figure 5.1.1 Videos created by applying the "standard-full" variant of the system to 16 draw characters. |
Figure 5.2.1 Breathing motion of 16 drawn characters. The videos were created using the "standard-full" variant of the system. |
We can see that the movement is most of the time plausible. However, there are cases where the network wrongly classified the network wrongly located the chest area. For example, the skirt of the bottom right character also moves with the chest.
I created a tool that allows the user to control drawings by manipulating GUI elements.
I also created another tool that can transfer a human's movement, captured by an iOS application called iFacialMocap, to anime characters.
The above demonstrations show that my system is capable of generating good-looking animations when applied to many different characters. However, it can yield implausible results when fed inputs that deviate significantly from the training set. Because I did not change the face morpher in any way, its problems, as discussed in the 2021 writeup, still remain. These include the inability to handle unnatural skin tone, strong makeups, "maro" eyebrows, and rare occlusion of facial features. I also observed new problems specific to the newly designed body rotator.
First, the body rotator did not seem to handle large hats well. For examples, it thought that Nui Sociere's face is a part of the hat, erased the ears and tail of Millie Parfait's cat, and moved Nina Kosaka's ears in a way that was inconsistent with her head movement. These errors might be because my dataset does not have many models wearing large hats.
Nui Sociere (© ANYCOLOR, Inc.) | Millie Parfait (© ANYCOLOR, Inc.) | Nina Kosaka (© ANYCOLOR, Inc.) |
Second, it also had a tendency to erase thin structures around the head. While this might be acceptable for thin hair filaments, the result can be very noticeable for halos and rigid ornaments that should always be present.
Ninomae Ina'nis (© Cover corp.) | Ouro Kronii (© Cover corp.) |
Third, due to the lack of training data, the body rotator cannot correctly deal with props such as weapons and musical instruments.
Gawr Gura (© Cover corp.) | Mori Calliope (© Cover corp.) |
Kazama Iroha (© Cover corp.) | Rikka (© Cover corp.) |
This project aims to solve a variant of the image animation problem. Here, we are given an image of a character and a description of the pose that the character is supposed to take, and we are supposed to generate a new image of the same character taking the described pose. The problem can the be classified into several variants based on the nature of the pose description. They include the parameter-based posing and the motion transfer problem previously mentioned in Section 3.5.2.
In my 2021 write-up, I wrote an extensive survey of previous research on the two problems. Doing such a survey again would only make this article unreasonably long. So, what I would like to do instead is to discuss new research and development that came out after I published the write-up.
PIRenderer, the paper I compared my system against in Section 3.5.2, was published in ICCV 2021
Justin Johnson, Alexandre Alahi, Li Fei-Fei. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. ECCV 2016. [Project] [arXiv]
I became aware of PIRenderer through the AnimeCeleb paper by Kim et al.
Outside of academia, IRIAM, a streaming application where users can broadcast as anime characters, released a feature where a Live2D-like 2.5D model can be created from a single image. The enabling technology was the work of my friend, Yanghua Jin, while he was employed by Preferred Networks. From the promotion video, it seems that IRIAM supports facial expression manipulation and rotation of the body and the head around the
There are a number of interesting new approaches to motion transfer, especially those that discover moving parts without explicit supervision.
The paper "Motion Representations for Articulated Animation" (MRAA)
The paper "Thin-Plate Spline Motion Model for Image Animation" by Zhao and Zhang proposes another way to represent motion of body parts
Lastly, "Structure-Aware Motion Transfer with Deformable Anchor Model" by Tao et al. proposes the deformable anchor model (DAM) as a represention for the character's movement
Unfortunately, I have not tried how well these approaches work on my dataset yet.
In this article, I have discussed my attempt to improve my animation-from-a-single-image system. By replacing two constituent networks with newly designed ones, I enabled the system to rotate the body and generate breathing motion, making its features closer to those offered by professionally-made Live2D models. The system outputs plausible animations for characters with simple designs, but it struggles on those with props not sufficiently covered by the training dataset. These include large hats, weapons, musical instruments, and other thin ornamental structures.
I also explored making the system more efficient through using depthwise separable convolutions and the "half" type. Employing both, I made the system 18 times smaller, descreased its GPU RAM usage by about 3 to 4 times, and also slightly improved its speed. While this makes it easier to deploy the system on less powerful devices, more research is needed to make it significantly faster.
While I am an employee of Google Japan, this project is my personal hobby which I did in my free time without using Google's resources. It has nothing to do with work as I am a normal software engineer writing Google Maps backends for a living. Moreover, I currently do not belong to any of Google's or Alphabet's research organizations. Opinions expressed in this article is my own and not the company's. Google, though, may claim rights to the article's technical inventions.
I would like to thank Andrew Chen, Ekapol Chuangsuwanich, Yanghua Jin, Minjun Li, Panupong Pasupat, Yingtao Tian, and Pongsakorn U-chupala for their comments.
The system in this article takes a 45-dimensional pose vector as input. I show the semantics of each parameter below. The character is Souya Ichika (© 774 inc.).
Index | Name | Semantics |
0 | eyebrow_troubled_left | |
1 | eyebrow_troubled_right | |
2 | eyebrow_angry_left | |
3 | eyebrow_angry_right | |
4 | eyebrow_lowered_left | |
5 | eyebrow_lowered_right | |
6 | eyebrow_raised_left | |
7 | eyebrow_raised_right | |
8 | eyebrow_happy_left | |
9 | eyebrow_happy_right | |
10 | eyebrow_serious_left | |
11 | eyebrow_happy_right |
Index | Name | Semantics |
12 | eye_wink_left | |
13 | eye_wink_right | |
14 | eye_happy_wink_left | |
15 | eye_happy_wink_right | |
16 | eye_surprised_left | |
17 | eye_surprised_right | |
18 | eye_relaxed_left | |
19 | eye_relaxed_right | |
20 | eye_unimpressed_left | |
21 | eye_unimpressed_right | |
22 | eye_raised_lower_eyelid_left | |
23 | eye_raised_lower_eyelid_right |
Index | Name | Semantics |
24 | iris_small_left | |
25 | iris_small_right | |
37 | iris_rotation_x | |
38 | iris_rotation_y |
Index | Name | Semantics |
25 | mouth_aaa | |
27 | mouth_iii | |
28 | mouth_uuu | |
29 | mouth_eee | |
30 | mouth_ooo | |
31 | mouth_delta | |
32 | mouth_lowered_corner_left | |
33 | mouth_lowered_corner_right | |
34 | mouth_raised_corner_left | |
35 | mouth_raised_corner_right | |
36 | mouth_smirk |
Index | Name | Semantics |
39 | head_x | |
40 | head_y | |
41 | neck_z |
Index | Name | Semantics |
42 | body_y | |
43 | body_z | |
44 | breathing |
Project Marigold