Image Retrieval Overview (from Traditional Local Features to Recent Deep Learning Approaches)

1. 画像検索今昔物語株式会社ディー・エヌ・エー内田祐介

2. 特定物体認識 1 • 類似画像検索 • 一般物体認識（クラス分類） • 特定物体認識同じ物体（インスタンス）が写っている画像を検出 ResultQuery Query 空、雲 ResultQuery 大規模特定物体認識の最新動向 https://sites.google.com/site/yu4uchida/uchida_ieice2013.pdf

3. 大域特徴ベース vs 局所特徴ベース 2 • 大域特徴 (global feature) ベース – 画像から1つの特徴を抽出（e.g. カラーヒストグラム） – 類似画像検索ではうまくいくが特定物体認識ではうまくいかない • 局所特徴 (local feature) ベース – 画像から多数の局所特徴を抽出（e.g. SIFT） – それらのマッチング結果により類似度を定義 – SIFT等の強力な特徴量により deep learningに最後まで抵抗（最近やられた模様）

4. 大域特徴ベース vs 局所特徴ベース 3 • 大域特徴で検索 • 局所特徴で検索 • 局所特徴をaggregateして大域特徴にして検索 – FV, VLAD

5. 局所特徴ベース特定物体認識 4 • Detection：局所特徴領域の検出 • Description：局所特徴領域の記述 • Indexing＆Search：（近似）最近傍探索 • Post process – Geometric verification – Query expansion セットになることが多いが本来は独立して選択できる

6. 局所特徴を用いた特定物体認識 4/16/2018 5 ①Extract local regions (patches) from images ②Describe the patches by d-dimensional vectors ③Make correspondences between similar patches ④Calculate similarity between the images Similarity: 3 Position (x, y) Orientation θ Scale σ Feature vector f (e.g., 128-dim SIFT) Local feature

7. 局所特徴領域の検出手法 6 • Blobタイプとコーナータイプ • 回転不変、スケール不変、アフィン不変とタイプ分けされる • 基本的なアイディア＝畳み込みフィルタの応答の極大値により検出

8. マルチスケール検出の直感的理解 7 • Blobの中心とカーネルの中心が一致するときが一番responseが大きくなる信号（画像）畳み込みカーネル（e.g. LoG）

9. マルチスケール検出の直感的理解 8 • カーネルサイズとblobのスケールが一致するときが一番responseが大きくなる • スケールスペースでのフィルタ応答が極大となる＝局所特徴

10. 局所特徴領域の検出手法 9 Hessian Beaudet’78 Harris Harris’88 LoG Lindeberg’98 DoG Lowe’99 SURF Bay’06 Harris-Laplace Mikolajczyk’01 Hessian-Affine Mikolajczyk’04 Harris-Affine Mikolajczyk’02 FAST Rosten’05 Affine-invariant Scale-invariant Rotation-invariant LoG scale seletion Affine adaptation Multi-scale + Box filter acceleration LoG approximation Hessian-Laplace Mikolajczyk’01 Oriented FAST Rublee’11 SUSAN Smith’97 Simplification + tree acceleration Orientation Corner-like Blob-like (SIFT) (ORB)

11. 局所特徴領域の記述手法 10 • 実数値タイプとバイナリタイプがある SIFT Lowe’99 SURF Bay’06 BRIEF Calonder’10 ORB Rublee’11 GLOH Mikolajczyk’05 FREAK Alahi’12 A-KAZE Alcantarilla’13 LDB Yang’12 LATCH Levi’16 BRISK Leutenegger’11 Real-valued Binary (0.56, 0.22, -0.10, …, 0.96) (1, 0, 0, …, 1) RootSIFT Arandjelovic’12

12. どれを使えば良いの？ 11 • 精度重視 – SIFT or Hessian Affine detector + RootSIFT descriptor • 速度重視 – ORB detector + ORB descriptor • Local Feature Detectors, Descriptors, and Image Representations: A Survey https://arxiv.org/abs/1607.08368

13. RootSIFT [Arandjelovic+, CVPR’12] 4/16/2018 12 • Hellinger kernel works better than Euclidean distance in comparing histograms such as SIFT • Hellinger kernel (Bhattacharyya’s coefficient) for L1 normalized histograms x and y: • Explicit feature map of x into x’ : – L1 normalize x – element-wise square root x to give x’ – then x’ is L2 normalized • Computing Euclidean distance in the feature map space is equivalent to Hellinger distance in the original space: RootSIFT RootSIFT

14. Large-scale Object Recognition 4/16/2018 13 ・・・ Distance calculation Query image Reference images Explicit feature matching requires high computational cost and memory footprint Match Bag-of-visual words!

15. Bag-of-Visual Words [Sivic+, ICCV’03] 4/16/2018 14 • Offline – Collect a large number of training vectors – Perform clustering algorithm (e.g., k-means) – Centroids of clusters = visual words (VWs) • Online: – All features are assigned to their nearest visual words – An image is represented by the frequency histogram of VWs – (Dis)similarity is defined by the distance between histograms Visual words (VW) VW1 VWn VW2 … Visual words －－・・・－－－・・・－－－・・・－－－・・・－－－・・・－ Frequency }1|{ Nii  vV

16. Bag-of-Visual Words [Sivic+, ICCV’03] 4/16/2018 1515 VW1 VW2 VWk VWn ・・・・・・ Indexing step (quantization) Search step (quantization) Match Match Matching can be performed in O(1) with an inverted index Query image Reference images Nearest VW

17. 1 2 w N Inverted index Image ID 1 2 3 4 5 6 7 8 9 10 11 12 ... Image ID Accumulated scores VW ID Obtain image IDs Query image Reference image Image ID ...(x, y) σ θ (1) Feature detection (2) Feature description (3) Quantization (1) Feature detection (2) Feature description (3) Quantization (4) Voting ... ... ... ... Visual word v1 ... Visual word vw ... Visual word vN Visual words 1 4 5 7 10 16 19 Offline step Visual word v1 ... Visual word vw ... Visual word vN Visual words Get images with the top-K scores Results inlier outlier (5) Geometric verification 全体処理 Geometric verification

18. Geometric (Spatial) Verification 17 • マッチングした結果には誤検出が含まれる – 正解のマッチング（inlier）はある幾何的な変換モデルに対して整合性が取れているはずなので、モデルの推定とinlierの同定を同時に行う→RANSAC – inlierのみを用いて画像間の類似度とすると精度が向上 outlier inlier

19. モデル; p’ = Mp 18 rotation scalingtranslation similarity trans. affine trans. perspective trans. 1DoF 2DoF 1DoF 4DoF 5DoF 6DoF 7DoF Fundamental Matrix

20. RANSAC 19 1. モデルパラメータを計算できる対応点をランダムサンプリング 2. モデルパラメータを算出 3. 全ての点対応で、上記のモデルパラメータと整合する点対応をinlierとみなす 4. 上記を一定回数繰り返し、一番inlierが多かったモデルパラメータを採用

21. Weak Geometric Consistency [Jegou+, ECCV’08] 4/16/2018 20 • スケール比、角度差はconsistentなので角度差、スケール比空間にハフ変換的に投票する – 正解はスコアが下がらないが不正解ペアのスコアが大きく下がる

22. どのモデルを使えばよいの？ 21 • とりあえず相似変換かアファイン変換 ←対象から離れていれば大体相似変換で近似可能 • スケールと角度がある特徴領域だと 1つの対応点から相似変換が求まる！ →全ペアに対してモデル推定＋inlier算出をする • その後、より自由度の大きいモデルをフィッティングしても良い J. Philbin et al., “Object retrieval with large vocabularies and fast spatial matching,” CVPR’17.

23. Query Expansion 22 • 最初の検索結果を元に、新たな検索クエリを人工的に作成し、「芋づる式」に検索結果を改善することを狙うクエリ検索結果拡張クエリ新たな検索結果

24. Average Query Expansion [Chum+, ICCV’07] 4/16/2018 23 • Obtain top (m < 50) verified results of original query • Construct new query using average of these results Without geometric verification, QE degrades accuracy! Query image Verified results New query

25. Multiple Image Resolution Expansion [Chum+, ICCV’07] 4/16/2018 24 ROI Query image ROI ROIROI ROI ROI ROI First verified results ROI ROI ROI ROI ROI ROI • Calculate relative change in resolution • Construct average query for each resolution New query1 New query2 New query3

26. Query Expansion Results 4/16/2018 25 • ori = original query • qeb = query expansion baseline • trc = transitive closure expansion • avg = average query expansion • rec = recursive average query expansion • sca = multiple image resolution expansion

27. Discriminative Query Expansion [Arandjelovic+, CVPR’12] 4/16/2018 26 • Train a linear SVM classifier – Use verified results as positive training data – Use low ranked images as negative training data – Rank images on their signed distance from the decision boundary – Reranking can be efficient with an inverted index!

28. Aggregation Methods 27 • 局所特徴は1画像から1000前後抽出される • 画像が多いとインデックスが肥大化 • 特に画像認識では1つのベクトルとして扱いたい – Fisher Vector (FV) – VLAD • 精度を求める場合は使わない

29. 最近傍探索 (Nearest Neighbor Search, NNS) 28 • 距離空間 M における点の集合 S とクエリ点 q∈M が与えられた際に S の中で q に最も近い点を探す – k近傍 / range search • ユークリッド空間での最近傍探索を扱うことがほとんど • kd-tree, SR-tree等のindexingにより高速化（高次元（数十？）で次元の呪いにかかる） + + + ++ + + + + + + + o q Input + + + ++ + + + + + + + o q Output S

30. 近似最近傍探索 29 • エラーを許す代わりに高速化、エラー率とトレードオフ – 速度、精度、メモリ使用量がトレードオフになる • 木構造＋priority search – kd-tree, randomized kd-trees, hierarchical kd-tree – メモリを気にしなければ無難で良い • Locality Sensitive Hashing (LSH) 系 – ***LSHがいっぱい。個人的には嫌い • 直積量子化系 – サーベイ → https://www.jstage.jst.go.jp/article/mta/6/1/6_2/_article/-char/ja/ – データを圧縮し、圧縮したまま検索 • バイナリ圧縮系 – いっぱいある https://www.slideshare.net/ren4yu/k-means-hashing-up (Heさんだよ） – バイナリ符号にするのでpopcnt命令で距離計算できる（がそのままだとlinear search）

31. え？でもCNNのほうが良いんでしょ？ 30

32. CNN系 (global feature) 31 • CNN Features off-the-shelf: an Astounding Baseline for Recognition https://arxiv.org/abs/1403.6382 – クラス分類用のCNN (OverFeat) のFCをそのまま使っても結構良い • Neural Codes for Image Retrieval https://arxiv.org/pdf/1404.1777.pdf – 最終層前のFCを使ったほうが良いとか、検索対象のドメインで finetuneしたほうが良いとか • CNN Image Retrieval Learns from BoW: Unsupervised Fine- Tuning with Hard Examples https://arxiv.org/abs/1604.02426 – Siamese Networkで学習 • Global featureでもかなり良い（vs. FV/VLAD) • 基本的に回転・スケール不変ではないことに注意

33. CNN系 (local feature) 32 • LIFT: Learned Invariant Feature Transform https://arxiv.org/abs/1603.09114 – 検出、角度推定、記述をend-to-endで学習 – 遅いし検索では精度出ていない • Large-Scale Image Retrieval with Attentive Deep Local Features https://arxiv.org/abs/1612.06321 – FCN＋アテンション（マルチスケールでやる）で局所特徴を定義 – 良さげ https://github.com/tensorflow/models/tree/master/researc h/delf – 回転不変性は担保されない

34. DELF 33

35. Comparative Study 34 • Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking https://arxiv.org/abs/1803.11285 – Local, global, CNN/非CNNが網羅的に比較されている（が、著者らのチームにバイアスがかかっているかも） Local Global 非CNN CNN

36. ベストプラクティス① 35 • Global → https://arxiv.org/abs/1711.02512 – 性能の良いベースネットワークを利用（ResNet以上）し、finetune（Siamere?）する – generalized mean-pooling (Lp, p=3) を利用 – 複数スケール (region) を利用 – RegionレベルでDiffusionベースのquery expansion https://arxiv.org/abs/1611.05113

37. ベストプラクティス② 36 • Local → https://hal.inria.fr/hal-01131898/document – 特徴量としてはDELFを利用 – Indexing, matching, scoringがややこしい（ASMK – Geometric verificationは必須 – Query expansionもやる

Image Retrieval Overview (from Traditional Local Features to Recent Deep Learning Approaches)

Yusuke Uchida

Image Retrieval Overview (from Traditional Local Features to Recent Deep Learning Approaches)