Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
Jun-Yan Zhu* Taesung Park* Phillip Isola Alexei A. Efros
UC Berkeley
[Paper] [Code (Torch)] [Code (PyTorch)]
Abstract
Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. Our goal is to learn a mapping G: X → Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping F: Y → X and introduce a cycle consistency loss to push F(G(X)) ≈ X (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of our approach.
Paper
arxiv 1703.10593, 2017.
Citation
Jun-Yan Zhu*, Taesung Park*, Phillip Isola, and Alexei A. Efros. "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks", in arxiv, 2017.
(* indicates equal contributions)
Bibtex
Code and Data: [Torch] [PyTorch]
Applications
Monet Paintings → Photos Mapping Monet paintings to landscape photographs from Flickr: |
Collection Style Transfer Transferring input images into artistic styles of Monet, Van Gogh, Ukiyo-e, and Cezanne. |
Object Transfiguration Object transfiguration between horses and zebras: |
Horse Video to Zebra Video |
|
Season Transfer Transferring seasons of Yosemite in the Flickr photos: |
Photo Enhancement iPhone photos → DSLR photos: generating photos with shallower depth of field. |
Experiments and comparisons
- Comparison on Cityscapes: different methods for mapping labels ↔ photos trained on Cityscapes.
- Comparison on Maps: different methods for mapping aerialphotos ↔ maps on Google Maps.
- Facade results: CycleGAN for mapping labels ↔ facades on CMP Facades datasets.
- Ablation studies: different variants of our method for mapping labels ↔ photos trained on Cityscapes.
- Image reconstruction results: the reconstructed images F(G(x)) and G(F(y)) from various experiments.
- Style transfer comparison: we compare our method with neural style transfer [Gatys et al. '15].
- Identity mapping loss: the effect of the identity mapping loss on Monet to Photo.
Various Applications
- Renoir Style: The movie Beauty and the Beast, 2017, rendered in the impressionism artist Renoir style
Failure Cases
Our model does not work well when a test image looks unusual compared to training images, as shown in the left figure.
See more typical failure cases [here]. On translation tasks that involve color and texture changes, like many of those reported above, the method often succeeds. We have also explored tasks that require geometric changes, with little success. For example, on the task of dog ↔ cat transfiguration, the learned translation degenerates to making minimal changes to the input. Handling more varied and extreme transformations, especially geometric changes, is an important problem for future work. We also observe a lingering gap between the results achievable with paired training data and those achieved by our unpaired method. In some cases, this gap may be very hard -- or even impossible,-- to close: for example, our method sometimes permutes the labels for tree and building in the output of the cityscapes photos → labels task. To resolve this ambiguity may require some form of weak semantic supervision. Integrating weak or semi-supervised data may lead to substantially more powerful translators, still at a fraction of the annotation cost of the fully-supervised systems.
Related Work
Acknowledgement
We thank Aaron Hertzmann, Shiry Ginosar, Deepak Pathak, Bryan Russell, Eli Shechtman, Richard Zhang, and Tinghui Zhou for many helpful comments. This work was supported in part by NSF SMA-1514512, NSF IIS-1633310, a Google Research Award, Intel Corp, and hardware donations from NVIDIA. JYZ is supported by the Facebook Graduate Fellowship and TP is supported by the Samsung Scholarship. The photographs used in style transfer were taken by AE, mostly in France.