As is possibly suggested by the mood-setting graphic above, the goal of Frankenimage is to reconstruct input (target) images with pieces of images from a large image database (the database images). Hence the racks of computers flanking the portrait of Frankenstein's monster.
Frankenimage is deliberately in contrast with traditional photomosaics. In traditional photomosaics, more often than not, the database images that are composed together to make up the target image are so small as to be little more than glorified pixels. Frankenimage aims instead for component database images to be as large as possible in the final composition, taking advantage of structure in each database image, instead of just its average color. In this way, database images retain their own meaning, allowing for real artistic juxtaposition to be achieved between target and component images.
The following image represents a series of 20 oil on canvas paintings by contemporary artist Lewis Lavoie, each depicting an individual, in tandem depicting the head of Adam from Michelangelo's The Creation of Adam.
In order to reconstruct an arbitrary target image out of pieces of other arbitrary database images, you need a lot of database images in order to have the best chances of finding good matches. I created my database by querying Flickr's search API with each of the unique terms from the complete works of William Shakespeare. I downloaded up to 10 images for each query; this resulted in about 158,000 images on my hard drive, at roughly 1 megapixel each.
My first matching attempt was as follows: for each database image (on left in figure above), crop to aspect ratio of target image (on right in figure above) and compare their gist descriptors [Oliva, Torralba]. The database images with the closest gist descriptors win. With the 10 or 20 closest matches, pieces of them would be used to reconstruct the target image. This strategy was inspired by Scene Completion Using Millions of Photographs [Hays, Efros].
The top matches from my gist-based search were unconvincing:
I have three theories about why my gist-based search fell short.
First of all, my database contains 158,000 images, whereas Hays and Efros's database contained millions of images. Hays and Efros observed that increasing the size of their database from ~10K to ~2 million resulted in a "qualitative leap in performance."
Second, Hays and Efros' goal was to find content to put inside a hole in the target image. In their search, they omitted the parts of the gist descriptor that pertained to the hole. I could not do this, because I wanted to replace parts of my target image with something similar from the database, whereas Hays and Efros only cared that the rest of the image matched. In addition, Hays and Efros compared images directly (outside the hole region) and gave their direct difference 1/2 the weight of the gist difference.
Finally, it's possible my database collection method (Shakespeare terms) simply biased my database to not have many matches for my target image (headshot of myself).
Direct Cell Matching
Reacting against my failed attempt at using gist on whole images, I decided to keep it simple and match images directly — and also to match against fixed sub-regions of the target image. Ideally, database images comprising a Frankenimage could be composed in arbitrary positions, scales, and orientations, for maximum Frankenstonian haphazardness. For purposes of a first pass implementation, however, it simplifies things a lot to restrict the geometry of the composition. So in this implementation, much in the fashion of Lewis Lavoie's work, the target image is partitioned into a regular grid of squares, or target cells. The "best" match for each target cell is then found in the database, "best" in quotes because the algorithm I use is iterative, in anticipation of an image database that is too large to exhaustively search, or an "on-the-fly" database seeded perhaps by a search query from a user, which streams in live from the depths of the Internet, and cannot necessarily be preprocessed into an insta-match datastructure.
Direct cell matching pseudocode:
break target image into grid of squares
loop until satisfyingly converged:
   grab a database image
   crops = 10 or 20 random square crops of database image
   for each crop in crops:
      for each cell in target image:
         red_diff   = crop.r2 - cell.r2
         green_diff = crop.g2 - cell.g2
         blue_diff  = crop.b2 - cell.b2
         diff = sum(red_diff) + sum(green_diff) + sum(blue_diff)
         if diff < smallest diff for cell:
            set smallest diff for cell to crop
This matching strategy was more successful, generally having converged after about 24 hours. Ultimately, on the reconstruction of my headshot, I performed about 1 billion cell to crop comparisons, iterating through all 158,000 of the images in my database multiple times (See Future section below for commentary on this). In the video below, you can see the iteration in action as the matching converges (there is a jump in the beginning because I was originally not saving these images regularly).
After settling on a set of grid matches, one final step is performed: seam optimization. Each winning database image crop potentially has pixels outside of the grid cell where it has been placed. These extra pixels, which were not used in the winning subtraction, may nonetheless match the target image well, given that the crop and target cell are such a good match. So, to let these better matches shine through, another iterative process is performed:
target image T[][]
franken image F[][] = best matches pasted into grid
mark all seam pixel pairs hot
(each pixel is paired with a N, S, E, and a W pixel)
loop until satisfyingly converged:
   for each hot pair of pixels (p1, p2),
   from different database images (d1[][], d2[][]),
   subscripted in the target image's coordinate system:
      which of the following is smallest?
      case T[p1.x][p1.y]2 - d1[p1.x][p1.y]2 + T[p2.x][p2.y]2 - d2[p2.x][p2.y]2:
            // status quo
            mark this pixel pair cold
      case T[p1.x][p1.y]2 - d1[p1.x][p1.y]2 + T[p2.x][p2.y]2 - d1[p2.x][p2.y]2:
            // d1 is better in both pixels
            set F[p2.x][p2.y] = d1[p2.x][p2.y]
            mark p2's 3 other pairings hot
            
      case T[p1.x][p1.y]2 - d2[p1.x][p1.y]2 + T[p2.x][p2.y]2 - d2[p2.x][p2.y]2:
            // d2 is better in both pixels
            set F[p1.x][p1.y] = d2[p1.x][p1.y]
            mark p1's 3 other pairings hot
I couldn't get this process to actually converge to 0 hot pairs. The number of hot pairs increased in the beginning, then slowly decreased, well below its original value, but eventually hovered at around 5% of the original hot pair count. I'm not sure why this happened, but it doesn't really matter for the result. The following videos show this process in action:
Results
Future
I am happy with the final (seam optimized) results above, it is a nice effect. Optimizing the seams in the way I did, however, washes out the component database images and squishes them together, removing a lot of their solo meanings. In future work, I'd like to perturb the component images less, while still sufficiently depicting the target image. There are a number of improvements my algorithms and their implementations that could help achieve this goal.
Database
A larger database with more varied, more uniformly distributed content would likely present the opportunity to find better matches. In an end-user application, the user should be able to seed a database of their choice with arbitrary search terms.
Matching
I devolved into a direct pixel-wise SSD matching approach after failing to use gist effectively, but something a bit higher level could prove quite fruitful, specifically something that is more edge aware. Direct color could still play a factor, or not, and the result could be a really great Frankenstonion mashup where structure is preserved but color is arbitrary.
- Normalize images before comparing them with SSD (e.g. factor out per image illumination variations)
- Compare edge images from e.g. an edge detector
- Compare gradient images (which don't make a harsh binary decision about edges like an edge detector does)
- Use a higher level image descriptor, like HoG, which can encode spatial edge responses with selectable dimensionality, and could fit into a tree data structure nicely (see Speedups section below).
- Use gist!
- Use a self similarity descriptor, which shows results [Detecting and Sketching the Common, Bagon et. al.] perhaps very inline with the goals of Frankenimage.
Speedups
Speedups don't necessarily have any effect on the result. In this work so far, speed was not a significant issue, even though I let the matching code iterate for days. However, a key matching improvement can likely be derived from using a significantly larger database, in which case speedups could start to mean the difference between success and intractability.
- Database iteration: Originally I had assumed that the matching would benefit most from trying every image in the database, rather than trying every possible crop of each database image. So the match used 10 or 20 random crops from each database image and then moved on to the next database image. It turns out the matching code walked through the entire database multiple times over the course of a couple days. Furthermore, my CPU was way underutilized because disk I/O was the bottleneck. So instead of only trying 10 or 20 random crops at each database image, I should have tried 100 or 200, or robustly tried all possible crops (rotations included), in order to saturate the CPU and likely not effect the database image consumption rate at all. In fact, since direct matching was performed at 16 by 16 pixel resolution and I had 158,000 images and I limited crops to minimum 50% of the original image's size (in order to retain solo meaning), the size of all the values in question was just 450 MB, which easily fits in modern RAM. Caching all these values in RAM could lead to very significant (and an easy to implement) speedup.
- Tree descent instead of brute force: Obviously descending a tree would prove to be much much faster than the naive brute force matching I used (although for purposes of a FrankenInstagram, a 24-hour wait for "film processing" would have immense retro appeal). Matching at multiple scales could fit well with this model, or with a HoG matching scheme, or for that matter, any image description whose dimensionality decreases with image resolution would fit well with a tree descent. Nearest neighbors, kd-tree, etc.
Video
- Component images should persist for as long as possible from frame to frame, and they should move with the motion in the video (resulting in an effect similar to, but more deliberate than the artifacts seen when MPEG video is improperly decoded).
- The creator of a Frankenvideo video could set keyframes for database search queries, which would factor into a big fat cost function.
- Optimize a cost function including a term that wants the component images to persistence from frame to frame, a term that wants each target frame to be represented well, and a term that wants to satisfy the creator's database seed query keyframes.