(cache) ISPRS 2017

Semantic Segmentation Road/Lane Detection
	Road Scene Segmentation from a Single Image [pdf] [slide] Jose Manuel Alvarez and Theo Gevers and Yann LeCun and Antonio M. Lopez	ECCV 2012 Alvarez2012ECCV

Recovering the 3D structure of the road scenes
Convolutional neural network to learn features from noisy labels to recover the 3D scene layout
Generating training labels by applying an algorithm trained on a general image dataset
Train network using the generated labels to classify on-board images (offline)
Online learning of patterns in stochastic random textures (i.e. road texture)
Texture descriptor based on a learned color plane fusion to obtain maximal uniformity in road areas
Offline and online information are combined to detect road areas in single images
Evaluation on a self-recorded dataset and CamVid

Semantic Segmentation Road/Lane Detection
	Road Detection Based on Illuminant Invariance [pdf] [slide] Jose Manuel Alvarez and Antonio M. Lopez	TITS 2011 Alvarez2011TITS

Identifying road pixels is a major challenge due to the intraclass variability caused by lighting conditions. A particularly difficult scenario appears when the road surface has both shadowed and nonshadowed areas
Proposes a novel approach to vision-based road detection that is robust to shadows

Contributions:
- Uses a shadow-invariant feature space combined with a model-based classifier
- Proposes to use the illuminant-invariant image as the feature space
- This invariant image is derived from the physics behind color formation in the presence of a Planckian light source, Lambertian surfaces, and narrowband imaging sensors.
- Sunlight is approximately Planckian, road surfaces are mainly Lambertian, and regular color cameras are near narrowband

Evaluates on self-recorded data

Reconstruction Stereo
	Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches [pdf] [slide] Zbontar, Jure and LeCun, Yann	JMLR 2016 Zbontar2016JMLR

Matching cost computation by learning a similarity measure on patches using a CNN
- Siamese network with normalization and cosine similarity in the end
- Fast architecture and accurate architecture (+fully connected layers)
Binary classification of similar and dissimilar pairs
- Sampling negatives in the neighbourhood of the positive
- Margin loss
A series of post-processing steps:
- cross-based cost aggregation, semiglobal matching, a left-right consistency check, subpixel enhancement, a median filter, and a bilateral filter
The best performing on KITTI 2012, 2015 datasets

Semantic Segmentation Road/Lane Detection
	3D Scene Priors for Road Detection [pdf] [slide] Jose M. Alvarez and Theo Gevers and Antonio M. Lopez	CVPR 2010 Alvarez2010CVPR

Visionbased road detection
Current methods:
- Based on low-level features only
- Assuming structured roads, road homogeneity, and uniform lighting conditions
Information at scene, image and pixel level by exploiting sequential nature of the data
Low-level, contextual and temporal cues combined in a Bayesian framework
Contextual cues as horizon lines, vanishing points, 3D scene layout and 3D road stages
Robust to varying imaging conditions, road types, and scenarios (tunnels, urban and high-way)
Combined cues outperforms all other individual cues.

Tracking Person Tracking
	Monocular 3D Pose Estimation and Tracking by Detection [pdf] [slide] Mykhaylo Andriluka and Stefan Roth and Schiele, Bernt	CVPR 2010 Andriluka2010CVPR

3D pose estimation from image sequences using tracking by detection
Methods so far worked well in controlled environments but struggle with real world scenarios
Three staged approach
- Initial estimate of 2D articulation and viewpoint of the person using an extended 2D person detector
- Data association and accumulation into robust estimates of 2D limbs positions using a HMM based tracking approach
- Estimates used as robust image observation to reliably recover 3D pose in a Bayesian framework using hGPLVM as temporal prior
Evaluation on HumanEva II and a novel real world dataset TUD Stadtmitte for qualitative results

Tracking Person Tracking
	People-Tracking-by-Detection and People-Detection-by-Tracking [pdf] [slide] M. Andriluka and S. Roth and B. Schiele	CVPR 2008 Andriluka2008CVPR

Combining detection and tracking in a single framework
Motivation:
- People detection in complex street scenes, but with frequent false positives
- Tracking for a particular individual, but challenged by crowded street scenes
Extension of a state-of-the-art people detector with a limb-based structure model
Hierarchical Gaussian process latent variable model (hGPLVM) to model dynamics of the individual limbs
- Prior knowledge on possible articulations
- Temporal coherency within a walking cycle
HMM to extend the people-tracklets to possibly longer sequences
Improved hypotheses for position and articulation of each person in several frames
Detection and tracking of multiple people in cluttered scenes with reoccurring occlusions
Evaluated on TUD-Campus dataset

Tracking
	Multi-target tracking by continuous energy minimization [pdf] [slide] Andriyenko, Anton and Schindler, Konrad	CVPR 2011 Andriyenko2011CVPR

Existing methods limit the state space, either by per-frame non-maxima suppression or by discretizing locations to a coarse grid

Contributions:
- Target locations are not bound to discrete object detections or grid positions, therefore defined in case of detector failure, and that there is no grid aliasing
- Proposes that convexity is not the primary requirement for a good cost function in the case of tracking.
- New minimization procedure is capable of exploring a much larger portion of the search space than standard gradient methods

Evaluates on sequences from terrace1,terrace2, VS-PETS2009, TUD-Stadtmitte datasets

Tracking
	Discrete-continuous optimization for multi-target tracking [pdf] [slide] Andriyenko, Anton and Schindler, Konrad and Roth, Stefan	CVPR 2012 Andriyenko2012CVPR

Multi-target tracking consists of the discrete problem of data association and the continuous problem of trajectory estimation
Both problems were tackled separately using precomputed trajectories for data association
Discrete-continuous optimization that jointly addresses data association and trajectory estimation
Continuous trajectory model using cubic B-splines
Discrete association using a MRF that assigns each observation to a trajectory or identifies it as outlier
Combined formulation with label costs to avoid too many trajectories
Evaluation on the TUD datasets

Motion & Pose Estimation Simultaneous Localization and Mapping
	Google Street View: Capturing the World at Street Level [pdf] [slide] Dragomir Anguelov and Carole Dulong and Daniel Filip and Christian Frueh and Stephane Lafon and Richard Lyon and Abhijit S. Ogale and Luc Vincent and Josh Weaver	COMPUTER 2010 Anguelov2010COMPUTER

Google Street View captures panoramic imagery of streets in hundreds of cities in 20 countries
Technical challenges in capturing, processing, and serving street-level imagery
Developed sophisticated hardware, software and operational processes
Pose estimation using GPS, wheel encoder, and inertial with an online Kalman-filter-based algorithm
Camera system consisting of 15 small cameras using 5 MP CMOS
Laser range data is aggregated and simplified by fitting a coarse mesh
Supports 3D navigation

Semantic Segmentation Semantic Segmentation of Aerial Images
	Semantic Segmentation of Earth Observation Data Using Multimodal and Multi-scale Deep Networks [pdf] [slide] Nicolas Audebert and Bertrand Le Saux and Sebastien Lefevre	ARXIV 2016 Audebert2016ARXIV

Investigates the use of deep fully convolutional neural networks for pixel-wise scene labeling of Earth Observation images

Contributions:
- Transfers efficiently a deep fully convolutional neural networks from generic everyday images to remote sensing images
- Introduces a multi-kernel convolutional layer for fast aggregation of predictions at multiple scales
- Performs data fusion from heterogeneous sensors (optical and laser) using residual correction

Evaluates on ISPRS Vaihingen 2D Semantic Labeling dataset

Semantic Segmentation Road/Lane Detection
	Free Space Computation Using Stochastic Occupancy Grids and Dynamic Programming [pdf] [slide] H. Badino and U. Franke and R. Mester	ICCVWORK 2007 Badino2007ICCVWORK

The free space is the world regions where navigation without collision is guaranteed

Contributions:
- Presents a method for the computation of free space with stochastic occupancy grids
- Stereo measurements are integrated over time reducing disparity uncertainty.
- These integrated measurements are entered into an occupancy grid, taking into account the noise properties of the measurements
- Defines three types of grids and discusses their benefits and drawbacks
- Applies dynamic programming to a polar occupancy grid, to find the optimal segmentation between free and occupied regions

Evaluates on stereo sequences introduced in the paper

Representations Stixels
	The Stixel World - A Compact Medium Level Representation of the 3D-World [pdf] [slide] Badino, Hernan and Franke, Uwe and Pfeiffer, David	DAGM 2009 Badino2009DAGM

Motivation: Develop a compact, flexible representation of the 3D traffic situation that can be used for the scene understanding tasks of driver assistance and autonomous systems

Contributions:
- Introduces a new primitive, a set of rectangular sticks called stixel for modeling 3D scenes
- Each stixel is defined by its 3D position relative to the camera and stands vertically on the ground, having a certain height
- Each stixel limits the free space and approximates the object boundaries

Stochastic occupancy grids are computed from dense stereo information
Free space is computed from a polar representation of the occupancy grid
The height of the stixels is obtained by segmenting the disparity image in foreground and background disparities

Motion & Pose Estimation Localization
	Real-Time Topometric Localization [pdf] [slide] Hernan Badino and Daniel Huber and Takeo Kanade	ICRA 2012 Badino2012ICRA

Autonomous vehicles must be capable of localizing in GPS denied situations
Topometric localization which combines topological with metric localization
Build compact database of simple visual and 3D features with GPS equipped vehicle
Whole image SURF descriptor, a vector containing gradient information of entire image
Range mean and standard deviation descriptor
Localization using a Bayesian filter to match visual and range measurements to the database
Algorithm is reliable across wide environmental change, including lighting difference, seasonal variations
Evaluation using a vehicle with mounted video cameras and LIDAR
Achieving an average localization accuracy of 1 m on an 8 km route

Semantic Segmentation Road/Lane Detection
	Stereo-based Free Space Computation in Complex Traffic Scenarios [pdf] [slide] Badino, H. and Mester, R. and Vaudrey, T. and Franke, U.	IAI 2008 Badino2008IAI

Computation of free space in complex traffic scenarios including moving objects
By integrating stereo measurements over time
- Pixel-wise disparity and disparity speed with Kalman filters
- Stochastic occupancy grids using stereo information
- Dynamic programming on a polar-like occupancy grid
20 Hz frame on VGA resolution test images

Semantic Segmentation Label Propagation
	Mixture of trees probabilistic graphical model for video segmentation [pdf] [slide] Badrinarayanan, Vijay and Budvytis, Ignas and Cipolla, Roberto	IJCV 2014 Badrinarayanan2014IJCV

Mixture of trees probabilistic graphical model for semi-supervised video segmentation
Each component represents a tree structured temporal linkage between super-pixels from first to last frame
Variational inference scheme for this model to estimate super-pixel labels and the confidence
- Structured variational inference without unaries to estimate super-pixel marginal posteriors
- Training a soft label Random Forest classifier with pixel marginal posteriors
- Predictions are injected back as unaries in the second iteration of label inference
Inference over full video volume which helps to avoid erroneous label propagation
Very efficient in term of computational speed and memory usage and can be used in real time
Evaluation using the challenging SegTrack dataset (binary segmentation), CamVid driving video dataset(multi-class segmentation)

Semantic Segmentation Label Propagation
	Label Propagation in Video Sequences [pdf] [slide] Vijay Badrinarayanan and Fabio Galasso and Roberto Cipolla	CVPR 2010 Badrinarayanan2010CVPR

Labelling of video sequences is expensive
Hidden Markov Model for label propagation in video sequences
Using a limited amount of hand labelled pixels
Optic Flow based, image patches based, semantic regions based label propagation
Short sequences naive optic flow based propagation is sufficient otherwise more sophisticated models necessary
Evaluation by training Random forest classifier for video segmentation with ground truth and data from label propagation

Semantic Segmentation Semantic Segmentation
	SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation [pdf] [slide] Vijay Badrinarayanan and Alex Kendall and Roberto Cipolla	ARXIV 2015 Badrinarayanan2015ARXIV

Motivation: Need to design an architecture for scene understanding which is efficient in terms of memory & computational time

Contributions:
- Proposes novel manner in which the decoder upsamples its lower resolution input feature map
- The decoder uses pooling indices computed in the max-pooling step of the encoder to perform non-linear upsampling
- This eliminates the need for learning to upsample and significantly smaller number of trainable parameters

Evaluates on CamVid road scene segmentation & SUN RGB-D indoor scene segmentation

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	Exploiting Semantic Information and Deep Matching for Optical Flow [pdf] [slide] Min Bai and Wenjie Luo and Kaustav Kundu and Raquel Urtasun	ECCV 2016 Bai2016ECCV

Optical flow for autonomous driving
Assumptions
- Static background
- Small number of rigidly moving objects
Foreground/background segmentation using semantic segmentation network in combination with 3D object detection
Propose a siamese network with product layer that learns flow matching with uncertainty
Restrict the flow matches to lie on its epipolar line
Slanted plane model for background flow estimation
Evaluation on KITTI 2015

Datasets & Benchmarks Real Data
	A Database and Evaluation Methodology for Optical Flow [pdf] [slide] Baker, Simon and Scharstein, Daniel and Lewis, J. and Roth, Stefan and Black, Michael and Szeliski, Richard	IJCV 2011 Baker2011IJCV

Presents a collection of datasets for the evaluation of optical flow algorithms
Contributes four types of data to test different aspects of optical flow algorithms:
- Sequences with nonrigid motion where the ground-truth flow is determined by tracking hidden fluorescent texture
- Realistic synthetic sequences - addresses the limitations of previous dataset sequences by rendering more complex scenes with significant motion discontinuities and textureless regions
- High frame-rate video used to study interpolation error
- Modified stereo sequences of static scenes for optical flow
Evaluates a number of well-known flow algorithms to characterize the current state of the art
Extendes the set of evaluation measures and improved the evaluation methodology

Motion & Pose Estimation Localization
	Geo-localization of street views with aerial image databases [pdf] [slide] Mayank Bansal and Harpreet S. Sawhney and Hui Cheng and Kostas Daniilidis	ICM 2011 Bansal2011ICM

Aerial image databases are widely available while image from the ground of urban areas is limited
Localization of ground level images in urban areas using a database of satellite and oblique aerial images
Method for estimating building facades by extracting line segments from satellite and aerial images
Correspondence of building facades between aerial and ground images using statistical self-similarity with respect to other patches on a facade
Position and orientation estimation of ground images
Qualitative results on a region around Ridieu St. in Ottawa, Canada with BEV, Panoramio imagery and Google Street-view screen-shots

Reconstruction Reconstruction & Recognition
	Dense Object Reconstruction with Semantic Priors [pdf] [slide] Bao, S.Y. and Chandraker, M. and Yuanqing Lin and Savarese, S.	CVPR 2013 Bao2013CVPR

Dense reconstruction incorporating semantic information to overcome drawbacks of traditional multiview stereo
Learning a prior comprised of a mean shape and a set of weighted anchor points
Training from of 3D scans and images of objects from various viewpoints
Robust algorithm to match anchor points across instances enables learning a mean shape for the category
Shape of an object modelled as warped version of the category mean with instance-specific details
Qualitative and quantitative results on a small dataset of model cars using leave-one-out

Object Detection Person Detection
	Pedestrian detection at 100 frames per second [pdf] [slide] Rodrigo Benenson and Markus Mathias and Radu Timofte and Luc J. Van Gool	CVPR 2012 Benenson2012CVPR

Fast and high quality pedestrian detection
Two new algorithmic speed-ups:
- Exploiting geometric context extracted from stereo images
- Efficiently handling different scales
Object detection without image resizing using stixels
Similar to Viola and Jones: scale the features not the images, applied to HOG-like features
Detections at 50 fps (135 fps on CPU+GPU)
Evaluated on INRIA Persons and Bahnhof sequence

Object Detection Person Detection
	Ten Years of Pedestrian Detection, What Have We Learned? [pdf] [slide] Rodrigo Benenson and Mohamed Omran and Jan Hendrik Hosang and Bernt Schiele	ECCV 2014 Benenson2014ECCV

Aim is to review progress over the last decade of pedestrian detection, & try to quantify which ideas had the most impact on final detection quality
Evaluates on Caltech-USA, INRIA and KITTI datasets for comparing methods

Conclusions:
- There is no conclusive empirical evidence indicating that whether non-linear kernels provide meaningful gains over linear kernel
- The 3 families of pedestrian detectors (DPMs, decision forests, deep networks) are based on different learning techniques, their results are surprisingly close
- Multi-scale models provide a simple and generic extension to existing detectors. Despite consistent improvements, their contribution to the final quality is minor
- Most of the progress can be attributed to the improvement in features alone
- Combining the detector ingredients found to work well (better features, optical flow, and context) shows that these ingredients are mostly complementary

History of Autonomous Driving Autonomous Driving Projects
	VIAC: An out of ordinary experiment [pdf] [slide] Massimo Bertozzi and Luca Bombini and Alberto Broggi and Michele Buzzoni and Elena Cardarelli and Stefano Cattani and Pietro Cerri and Alessandro Coati and Stefano Debattisti and Andrea Falzoni and Rean Isabella Fedriga and Mirko Felisa and Luca Gatti and Alessandro Giacomazzo and Paolo Grisleri and Maria Chiara Laghi and Luca Mazzei and Paolo Medici and Matteo Panciroli and Pier Paolo Porta and Paolo Zani and Pietro Versari	IV 2011 Bertozzi2011IV

Presents the details and preliminary results of VIAC, the VisLab Intercontinental Autonomous Challenge, a test of autonomous driving along an unknown route from Italy to China
The onboard perception systems can detect obstacles, lane markings, ditches, berms and indentify the presence and position of a preceding vehicle
The information on the environment produced by the sensing suite is used to perform different tasks, such as leader-following, stop & go, and waypoint following
All data have been logged, including all data generated by the sensors, vehicle data, and GPS info
This data is available for a deep analysis of the various systems performance, with the aim of virtually running the whole trip multiple times with improved versions of the software
This paper discusses some preliminary results and figures obtained by the analysis of the data collected during the test

History of Autonomous Driving Autonomous Driving Projects
	Vision-based intelligent vehicles: State of the art and perspectives [pdf] [slide] Massimo Bertozzi and Alberto Broggi and Alessandra Fascioli	RAS 2000 Bertozzi2000RAS

Survey on the most common approaches to the challenging task of Autonomous Road Following
Computing power not a problem any more
Data acquisition still problematic with difficulties like light reflections, wet road, direct sunshine, tunnels, shadows.
Enhancement of sensor's capabilities and performance need to be addressed
Full automation of traffic is technically feasible
Legal aspects related to the responsibility and the impact of automatic driving on human passengers need to be carefully considered
Automation will be restricted to special infrastructure for now and will be gradually extended to other key transportations areas as shipping

Reconstruction Reconstruction & Recognition
	Large-Scale Semantic 3D Reconstruction: An Adaptive Multi-Resolution Model for Multi-Class Volumetric Labeling [pdf] [slide] Blaha, Maros and Vogel, Christoph and Richard, Audrey and Wegner, Jan D. and Pock, Thomas and Schindler, Konrad	CVPR 2016 Blaha2016CVPR

Joint formulation of semantic segmentation and 3D reconstruction enables to use class-specific shape priors
State-of-the-art could not scale to large scenes because of run time and memory
Extension of an expensive volumetric approach
- Hierarchical scheme using an Octree structure
- Refines only in regions containing surfaces
- Coarse-to-fine converges faster because of improved initial guesses
Saves 95 computation time and 98 memory usage
Evaluation on real world data set from the city of Enschede

Reconstruction Multi-view 3D Reconstruction
	Efficient Volumetric Fusion of Airborne and Street-Side Data for Urban Reconstruction [pdf] [slide] Andrs Bdis-Szomor and Hayko Riemenschneider and Luc Van Gool	ICPR 2016 Bodis-Szomoru2016ICPR

Introduces an approach that unifies a detailed street-side MVS point cloud & a coarser but more complete point cloud from airborne acquisition in a joint surface mesh
Airborne acquisition and on-road mobile mapping provide complementary 3D information of an urban landscape
The former acquires roof structures, ground, and vegetation at a large scale, but lacks the facade and street-side details, while the latter is incomplete for higher floors and often totally misses out on pedestrian-only areas or undriven districts

Proposes a point cloud blending & volumetric fusion based on ray casting across a 3D tetrahedralization, extended with data reduction techniques to handle large datasets
First to adopt a 3DT approach for airborne/street- side data fusion
Pipeline exploits typical characteristics of airborne and ground data, and produces a seamless, watertight mesh that is both complete and detailed

Evaluates on self-recorded 3D urban data

Object Detection Human Pose Estimation
	Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image [pdf] [slide] Federica Bogo and Angjoo Kanazawa and Christoph Lassner and Peter V. Gehler and Javier Romero and Michael J. Black	ECCV 2016 Bogo2016ECCV

Describes the first method to automatically estimate the 3D pose of the human body as well as its 3D shape from a single unconstrained image
Estimates a full 3D mesh and shows that 2D joints alone carry a surprising amount of information about body shape

First uses a CNN-based method, DeepCut, to predict the 2D body joint locations
Then fits a body shape model, called SMPL, to the 2D joints by minimizing an objective function that penalizes the error between the projected 3D model joints and detected 2D joints
Because SMPL captures correlations in human shape across the population, robust fitting is possible with very little data

Evaluates on Leeds Sports, HumanEva, and Human3.6M datasets

History of Autonomous Driving Autonomous Driving Projects
	End to End Learning for Self-Driving Cars [pdf] [slide] Mariusz Bojarski and Davide Del Testa and Daniel Dworakowski and	ARXIV 2016 Bojarski2016ARXIV

Convolutional Neural Network that learns vehicle control using images
Left and right images are used for data augmentation to simulate specific off-center shifts while adapting the steering command
Approximated viewpoint transformations assuming points below horizon lie on a plane and above are infinitely far away
The final network outputs steering commands for the center camera only
Tested with simulations and with the NVIDIA DRIVE PX self-driving car

Tracking
	Online Multiperson Tracking-by-Detection from a Single, Uncalibrated Camera [pdf] [slide] Michael D. Breitenstein and Fabian Reichlin and Bastian Leibe and Esther Koller-Meier and Luc J. Van Gool	PAMI 2011 Breitenstein2011PAMI

Automatic detection and tracking of a variable number of persons in complex scenes using a monocular, potentially moving, uncalibrated camera
Multi-person tracking-by-detection in a particle filtering framework using unreliable information from
- final high-confidence detections
- continuous confidence of pedestrian detectors
- online-trained, instance-specific classifiers as a graded observation model
Good performance on typical surveillance videos, webcam footage, or sports sequences
Datasets: ETHZ Central, TUD Campus and TUD Crossing, i-Lids AB, UBC Hockey, PETS09 S2.L1-S2.L3, ETHZ Standing, and a new Soccer dataset

Object Detection Person Detection
	Shape-based Pedestrian Detection [pdf] [slide] A. Broggi and M. Bertozzi and A. Fascioli and M. Sechi	IV 2000 Broggi2000IV

Detecting pedestrians on an experimental autonomous vehicle (the ARGO project)
Exploiting morphological characteristics (size, ratio, and shape) and vertical symmetry of human shape
A first coarse detection from a monocular image
Distance refinement using a stereo vision technique
Temporal correlation using the results from the previous frame to correct and validate the current ones
Integrated in the ARGO vehicle and tested in urban environments
Successful detections of whole pedestrians present in the image at a distance ranging from 10 to 40 meters

History of Autonomous Driving Autonomous Driving Projects
	PROUD - Public Road Urban Driverless-Car Test [pdf] [slide] Alberto Broggi and Pietro Cerri and Stefano Debattisti and Maria Chiara Laghi and Paolo Medici and Daniele Molinari and Matteo Panciroli and Antonio Prioletti	TITS 2015 Broggi2015TITS

An autonomous driving test on urban roads and freeways open to regular traffic
Moving in a mapped and familiar scenario with the addition of the position of pedestrian crossings, traffic lights, and guard rails
Real-time perception of the world for static and dynamic obstacles
No need for precise 3D maps or world reconstruction
Details about the vehicle, and main layers: perception, planning, and control
Complex driving scenarios including roundabouts, junctions, pedestrian crossings, freeway junctions, and traffic lights

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation [pdf] [slide] Brox, T. and Malik, J.	PAMI 2011 Brox2011PAMI

Coarse-to-fine warping for optical flow estimation
- can handle large displacements
- small objects moving fast are problematic
Integration of rich descriptors into a variational formulation
- Simple nearest neighbor search in coarse grid
- Feature matches used as soft constraint in continuous approach
- Continuation method: coarse-to-fine while reducing the importance of descriptor matches
Quantitative results only on Middlebury but real world qualitative results

Motion & Pose Estimation Localization
	Map-Based Probabilistic Visual Self-Localization [pdf] [slide] Marcus A. Brubaker and Andreas Geiger and Raquel Urtasun	PAMI 2016 Brubaker2016PAMI

Describes an affordable solution to vehicle self-localization which uses odometry computed from two video cameras & road maps as the sole inputs

Contributions:
- Proposes a probabilistic model for which an efficient approximate inference algorithm is derived
- The inference algorithm is able to utilize distributed computation in order to meet the real-time requirements of autonomous systems
- Exploits freely available maps & visual odometry measurements, and is able to localize a vehicle to 4m on average after 52 seconds of driving

Evaluates on KITTI visual odometry dataset

Motion & Pose Estimation Ego-Motion Estimation
	Flow-Decoupled Normalized Reprojection Error for Visual Odometry [pdf] [slide] Martin Buczko and Volker Willert	ITSC 2016 Buczko2016ITSC

Frame-to-frame feature-based ego-motion estimation using stereo cameras
Current approach: Rotation and translation of the ego-motion in two separate processes
An analysis of the characteristics of the optical flows and reprojection errors that are independently induced by each of the decoupled six degrees of freedom motion
A reprojection error that depends on the coordinates of the features
Decoupling the translation flow from the overall flow
- Using an initial rotation estimate
- Transforming the correspondences into a pure translation scenario
Evaluated on KITTI, the best translation error of all camera-based methods

Semantic Segmentation Label Propagation
	Label propagation in complex video sequences using semi-supervised learning [pdf] [slide] Budvytis, Ignas and Badrinarayanan, Vijay and Cipolla, Roberto	BMVC 2010 Budvytis2010BMVC

Directed graphical model for label propagation in long and complex video sequences
Given hand-labelled (semantic labels) start and end frames of a video sequence
Hybrid of generative label propagation and discriminative classification
EM based inference used for initial propagation and training of a multi-class classifier
Labels estimated by classifier are injected back into Bayesian network for another iteration
Iterative scheme has the ability to handle occlusions
Time-symmetric label propagation by appending the time-reversed sequence
Show advantage of learning from propagated labels
Quantitative and qualitative results on CamVid

History of Autonomous Driving Autonomous Driving Competitions
	The DARPA Urban Challenge [pdf] [slide] Martin Buehler and Karl Iagnemma and Sanjiv Singh	DARPA Challenge 2009 Buehler2009DARPAChallenge

History of Autonomous Driving Autonomous Driving Competitions
	The 2005 darpa grand challenge: The great robot race [pdf] [slide] Buehler, M. and Iagnemma, K. and Singh, S.	Book 2007 Buehler2007

Datasets & Benchmarks Synthetic Data
	A naturalistic open source movie for optical flow evaluation [pdf] [slide] Butler, D. J. and Wulff, J. and Stanley, G. B. and Black, M. J.	ECCV 2012 Butler2012ECCV

Introduction of MPI-Sintel, a new data set based on an open source animated film

Contributions:
- This data set has important features not present in the Middlebury flow evaluation: long sequences, large motions, specular reflections, motion blur, defocus blur, atmospheric effects.
- Analysis of the statistical properties of the data suggesting it is sufficiently representative of natural movies to be useful
- Introduction of new evaluation measures
- Comparison of public-domain flow algorithms
- Evaluation website that maintains the current ranking and analysis of methods

Object Detection 2D Object Detection
	A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection [pdf] [slide] Zhaowei Cai and Quanfu Fan and Rogerio Schmidt Feris and Nuno Vasconcelos	ECCV 2016 Cai2016ECCV

Multi-scale CNN for fast multi-scale object detection
Proposal sub-network performs detection at multiple output layers to match objects at different scales
Complementary scale-specific detectors are combined to produce a strong multi-scale object detector
Unified network is learned end-to-end by optimizing a multi-task loss
Feature upsampling by deconvolution reduces the memory and computation costs in contrast to input upsampling
Evaluation on KITTI and Caltech

Motion & Pose Estimation Localization
	Keyframe-based recognition and localization during video-rate parallel tracking and mapping [pdf] [slide] Robert Oliver Castle and David W. Murray	IVC 2011 Castle2011IVC

Object recognition, reconstruction and localization for augmented reality
3D map and keyframe poses are recovered at video-rate by bundle adjustment of FAST image features in the parallel tracking and mapping approach
Objects are detected and recognized using SIFT descriptors computed in keyframes and located by triangulation
Detected objects are automatically labelled on the user's display using predefined annotations
Demonstration using laboratory scenes and in more realistic applications, e.g. a guide to an art gallery
Limitation: Scaling to larger databases require adoption of hierarchical methods because of the computation time of SIFT features

End-to-End Learning of Sensorimotor Control
	DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving [pdf] [slide] Chenyi Chen and Ari Seff and Alain L. Kornhauser and Jianxiong Xiao	ICCV 2015 Chen2015ICCVa

Existing methods can be categorized into two major paradigms:
- Mediated perception approaches that parse an entire scene to make a driving decision
- Behavior reflex approaches that directly map an input image to a driving action by a regressor

Contributions:
- Proposes to map input image to a small number of perception indicators
- These indicators directly relate to the affordance of a road/traffic state for driving
- This representation provides a set of compact descriptions of the scene to enable a controller to drive autonomously

Semantic Segmentation Label Propagation
	Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision [pdf] [slide] Chen, Liang-Chieh and Fidler, Sanja and Yuille, Alan L. and Urtasun, Raquel	CVPR 2014 Chen2014CVPRb

Automatically segmentation of objects given annotated 3D bounding boxes
Inference in a binary MRF using appearance models, stereo and/or noisy point clouds, 3D CAD models, and topological constraints
10 to 20 labeled objects to train the system
Evaluated using 3D boxes available on KITTI
86 IOU score on segmenting cars (performance of MTurkers)
It can be used to de-noise MTurk annotations.
Segmenting big cars is easier than smaller ones.
Each potential increases performance (CAD model most).
Same performance with stereo or LIDAR (highest using both)
Fast: 2 min for training and 44 seconds for full test set
Robust to low-resolution, saturation, noise, sparse point clouds, depth estimation errors and occlusions

Semantic Segmentation Semantic Segmentation
	Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs [pdf] [slide] Liang-Chieh Chen and George Papandreou and Iasonas Kokkinos and Kevin Murphy and Alan L. Yuille	ICLR 2015 Chen2015ICLR

Final layer of CNNs not sufficiently localized for accurate pixel-level object segmentation
Overcome poor localization by combining final CNN layer with fully connected Conditional Random Field ¹
Using a fully convolutional VGG-16 network
Modified convolutional filters by applying the 'atrous' algorithm from wavelet community instead of subsampling
Significantly advanced the state-of-the-art in PASCAL VOC 2012 in semantic segmentation

^{1. Krahenbuhl, P. and Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.}

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	Full Flow: Optical Flow Estimation By Global Optimization over Regular Grids [pdf] [slide] Qifeng Chen and Vladlen Koltun	CVPR 2016 Chen2016CVPR

Discrete optimization over the full space of mappings for optical flow
Using a classical formulation with a normalized cross-correlation data term
Effective optimization over large label space with TRW-S
Min-convolution reduces the complexity of message passing from squared to linear
Reducing the space of mappings using a smaller resolution and max displacements
Epic Flow interpolation to fill inconsistent pixel and post processing for subpixel precision
State-of-the-art results on Sintel and KITTI 2015

Object Detection 2D Object Detection
	Monocular 3D Object Detection for Autonomous Driving [pdf] [slide] Xiaozhi Chen and Kaustav Kundu and Ziyu Zhang and Huimin Ma and Sanja Fidler and Raquel Urtasun	CVPR 2016 Chen2016CVPRa

3D object detection from a single monocular image in the domain of autonomous driving
Generates a set of candidate class-specific object proposals, which are then run through a standard CNN pipeline to obtain high-quality object detections
Focus of this paper is on proposal generation

Contributions:
- Proposes an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane
- Scores each candidate box projected to the image plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape

Evaluates on KITTI benchmark

Object Detection 2D Object Detection
	3D Object Proposals using Stereo Imagery for Accurate Object Class Detection [pdf] [slide] Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Huimin Ma and Sanja Fidler and Raquel Urtasun	ARXIV 2016 Chen2016ARXIV

High-accuracy 3D objection detection in autonomous driving scenario
Sensory-fusion framework that predicts oriented 3D bboxes using LIDAR point cloud and RGB images
Encode the sparse 3D point cloud with a compact multi-view representation
Proposal network generates 3D candidate boxes from bird's eye view representation of the point cloud
Deep fusion scheme combines region-wise multi-view features and enables interactions between intermediate layers
Evaluation on the KITTI benchmark outperforming state-of-the-art in 3D localization and 3D detection

Object Detection 3D Object Detection from 3D Point Clouds
	Multi-View 3D Object Detection Network for Autonomous Driving [pdf] [slide] Xiaozhi Chen and Huimin Ma and Ji Wan and Bo Li and Tian Xia	ARXIV 2016 Chen2016ARXIVa

3D object detection using both LIDAR point cloud and RGB images (predicting oriented 3D bounding boxes)
LIDAR point cloud for more accurate 3D locations, image-based methods for higher accuracy in terms of 2D box evaluation
Multi-View 3D networks (MV3D) using two sub-networks
3D proposal generation network utilizing a bird's eye view representation of point cloud
Multi-view feature fusion network:
- Region-wise features from multiple views
- Interactions between intermediate layers of different paths
- Superior performance over the early/late fusion scheme
Using drop-path training and auxiliary loss
Evaluated on KITTI benchmark, outperforming state-of-the-art by large margins

Tracking State-of-the-Art on KITTI
	Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor [pdf] [slide] Wongun Choi	ICCV 2015 Choi2015ICCV

Near-Online Multi-target Tracking (NOMT) algorithm formulated as global data association between targets and detections in temporal window
Designing an accurate affinity measure to associate detections and estimate the likelihood of matching
Aggregated Local Flow Descriptor (ALFD) encodes the relative motion pattern using long term interest point trajectories
Integration of multiple cues including ALFD metric, target dynamics, appearance similarity and long term trajectory regularization
Solves the association problem with a parallelized junction tree algorithm
Best accuracy with significant margins on KITTI and MOT dataset

Scene Understanding Indoor 3D Scene Understanding
	Understanding Indoor Scenes Using 3D Geometric Phrases [pdf] [slide] W. Choi and Y. -W. Chao and C. Pantofaru and S. Savarese	CVPR 2013 Choi2013CVPR

Proposes a novel unified framework that can reason about the semantic class of an indoor scene, its spatial layout, and the identity and layout of objects within the space

Contributions:
- Presents a hierarchical model for learning & reasoning about indoor scenes
- Proposes a 3D Geometric Phrase Model to capture semantic & geometric relationships between objects which co-occur in the same 3D configuration
- Shows that this model effectively explains scene semantics, geometry & object groupings while also improving individual object detections

Evaluates on new indoor-scene-object introduced in the paper

Tracking Tracking with two cameras
	A General Framework for Tracking Multiple People from a Moving Camera [pdf] [slide] W. Choi and C. Pantofaru and S. Savarese	PAMI 2013 Choi2013PAMI

Tracking multiple, possibly interacting, people from a mobile vision platform
Joint estimation of camera's ego-motion and the people's trajectory in 3D
Tracking problem formulated as finding a MAP solution and solved using Reversible Jump Markov Chain Monte Carlo Particle Filtering
Combination of multiple observation cues face, skin color, depth-based shape, motion, and target specific appearance-based detector
Modelling interaction with two modes: repulsion and group movement
Automatic detection of static features for camera estimation
Evaluation on the challenging ETH dataset and a Kinect RGB-D dataset containing dynamic in- and outdoor scenes

Datasets & Benchmarks Real Data
	The Cityscapes Dataset for Semantic Urban Scene Understanding [pdf] [slide] Cordts, Marius and Omran, Mohamed and Ramos, Sebastian and Rehfeld, Timo and Enzweiler, Markus and Benenson, Rodrigo and Franke, Uwe and Roth, Stefan and Schiele, Bernt	CVPR 2016 Cordts2016CVPR

A benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling
Specially tailored for autonomous driving in an urban environment
Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities
- 5000 of these images have high quality pixel-level annotations
- 20000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data

Develops a sound evaluation methodology for semantic labeling by introducing a novel evaluation measure
Evaluates several state-of-the-art approaches on the benchmark

Representations Stixels
	Object-Level Priors for Stixel Generation [pdf] [slide] Marius Cordts and Lukas Schneider and Markus Enzweiler and Uwe Franke and Stefan Roth	GCPR 2014 Cordts2014GCPR

Existing stixels representations are solely based on dense stereo and a strongly simplifying world model with a nearly planar road surface and perpendicular obstacles
Whenever depth measurements are noisy or the world model is violated, Stixels are prone to error

Contributions:
- Shows a principled way to incorporate top-down prior knowledge from object detectors into the Stixel generation
- The additional information not only improves the representation of the detected object classes, but also of other parts in the scene, e.g. the freespace

Evaluates on stereo sequence introduced in the paper

Reconstruction Multi-view 3D Reconstruction
	3D Urban Scene Modeling Integrating Recognition and Reconstruction [pdf] [slide] Cornelis, N. and Leibe, B. and Cornelis, K. and Van Gool, L. J.	IJCV 2008 Cornelis2008IJCV

Fast and memory efficient 3D city modelling
Application: a pre-visualization of a required traffic manoeuvre for navigation systems
Simplified geometry assumptions while still having compact models
- Adapted dense stereo algorithm with ruled-surface approximation
Integrating object recognition for detecting cars in video and then localizing them in 3D (not real-time yet)
3D reconstruction and localization benefit from each other.
Tested on three stereo sequences annotated with GPS/INS measurements

Motion & Pose Estimation Simultaneous Localization and Mapping
	FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance [pdf] [slide] Cummins, Mark and Newman, Paul	IJRR 2008 Cummins2008IJRR

Probabilistic approach to recognize places based on their appearance (loop closure detection)
Topological SLAM by learning a generative model of place appearances using bag-of-words
Combination of appearance words occur because they are generated from common objects
Approximation of a discrete distribution using Chow Liu algorithm
Robust in visually repetitive environments
Complexity linear in number of places and the algorithm is suitable for online loop closure detection in mobile robotics
Demonstration by detecting loop closures over 2km path in an initially unknown outdoor environment

Motion & Pose Estimation Ego-Motion Estimation
	Stereo odometry based on careful feature selection and tracking [pdf] [slide] Igor Cvisic and Ivan Petrovic	ECMR 2015 Cvisic2015ECMR

Stereo visual odometry based on feature selection and tracking (SOFT) for us: a good taxonomy is provided in intro
Careful selection of a subset of stable features and their tracking through the frames
Separate estimation of rotation (the five point) and translation (the three point)
Evaluated on KITTI, outperforming all
Pose error of 1.03 with processing speed above 10 Hz
A modified IMU-aided version of the algorithm
- An IMU for outlier rejection and Kalman filter for rotation refinement
- Fast and suitable for embedded systems at 20 Hz on an ODROID U3 ARM-based embedded computer

Semantic Segmentation Semantic Segmentation
	Instance-Aware Semantic Segmentation via Multi-Task Network Cascades [pdf] [slide] Dai, Jifeng and He, Kaiming and Sun, Jian	CVPR 2016 Dai2016CVPR

Limitations of existing methods for instance segmentation using CNNs
- Slow at inference time because they require mask propasal methods
- Don't take advantage of deep features and large amount of training data

End-to-end training of Multi-task Network Cascades for 3 tasks of differentiating instances, estimating masks & categorizing objects

Two orders of magnitude faster than previous systems
State-of-the-art on PASCAL VOC & MS COCO 2015

Reconstruction Reconstruction & Recognition
	Dense Reconstruction Using 3D Object Shape Priors [pdf] [slide] Dame, A. and Prisacariu, V.A. and Ren, C.Y. and Reid, I.	CVPR 2013 Dame2013CVPR

Incorporation of object-specific knowledge into SLAM
Current approaches
- Limited to the reconstruction of visible surfaces
- Photo-consistency error, sensitive to specularities
Initial dense representation using photo-consistency
Detection using a standard 2D sliding-window object-class detector
A novel energy to find the 6D pose and shape of the object
- Shape-prior represented using GP-LVM
Efficient fusion of the dense reconstruction with the reconstructed object shape
Better reconstruction in terms of clarity, accuracy and completeness
Faster and more reliable convergence of the segmentation with 3D data
Evaluated using dense reconstruction from KinectFusion

Motion & Pose Estimation Ego-Motion Estimation
	Stereo Visual Odometry Without Temporal Filtering [pdf] [slide] Joerg Deigmoeller and Julian Eggert	GCPR 2016 Deigmoeller2016GCPR

Ego-motion estimation from stereo avoiding temporal filtering and relying exclusively on pure measurements
Stereo camera set-up is the easiest and leads currently to the most accurate results
Two parts
- Scene flow estimation with a combination of disparity and optical flow on Harris corners
- Pose estimation with a P6P method (perspective from 6 points) encapsulated in a RANSAC framework
Careful selections of precise measurements by purely varying geometric constraints on optical flow measure
Slim method within the top ranks of KITTI without filtering like bundle adjustment or Kalman filtering

Motion & Pose Estimation Localization
	Monte Carlo Localization for Mobile Robots [pdf] [slide] Frank Dellaert and Dieter Fox and Wolfram Burgard and Sebastian Thrun	ICRA 1999 Dellaert1999ICRA

Presents the Monte Carlo method for localization for mobile robots
Represents uncertainty by maintaining a set of samples that are randomly drawn from it instead of describing the probability density function itself

Contributions:
- In contrast to Kalman filtering based techniques, it is able to represent multi-modal distributions and thus can globally localize a robot
- Reduces the amount of memory required compared to grid-based Markov localization
- More accurate than Markov localization with a fixed cell size, as the state represented in the samples is not discretized

Evaluates on datasets introduced in the paper

Object Detection Person Detection
	Pedestrian Detection: An Evaluation of the State of the Art [pdf] [slide] P. Dollar and C. Wojek and B. Schiele and P. Perona	PAMI 2011 Dollar2011PAMI

Pedestrian detection methods are hard to compare because of multiple datasets and varying evaluation protocols
Extensive evaluation of the state of the art in a unified framework
Large, well-annotated and realistic monocular pedestrian detection dataset
Refined per-frame evaluation methodology
Evaluation of sixteen pre-trained state-of-the-art detectors across six datasets
Performance of state-of-the-art is disappointing at low resolutions (far distant pedestrians) and in case of partial occlusions

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	FlowNet: Learning Optical Flow with Convolutional Networks [pdf] [slide] A. Dosovitskiy and P. Fischer and E. Ilg and P. Haeusser and C. Hazirbas and V. Golkov and P. v.d. Smagt and D. Cremers and T. Brox	ICCV 2015 Dosovitskiy2015ICCV

Network is trained end-to-end
The contracting part of the network extracts rich feature representation
Simple architecture : Process 2 stacked images jointly
Alternative architecture : Process images separately, then correlate their features at different locations
Expanding part of network produces high resolution flow
Train networks on large "Flying chairs" dataset with 2D motion of rendered chairs
Evaluated on Sintel and KITTI. Beats state of art among real time methods

Semantic Segmentation Road/Lane Detection
	Ground Plane Estimation Using a Hidden Markov Model [pdf] [slide] Ralf Dragon and Luc J. Van Gool	CVPR 2014 Dragon2014CVPR

Estimation of ground plane orientation and location in monocular video sequences from a moving observer
State-continuous HMM with the ego motion and ground plane normal as hidden states
Sample different hypotheses for pairs of frames using homography decomposition
Refined estimates are sampled with blocked Gibbs sampling
Works robustly in large variety of sequences including tilted cameras or heavily blurred and wobbling images
Using the approach for visual odometry achieved state-of-the-art distance errors but one magnitude lower angular errors

Reconstruction Stereo
	Semi-Global Matching: A Principled Derivation in Terms of Message Passing [pdf] [slide] Amnon Drory and Carsten Haubold and Shai Avidan and Fred A. Hamprecht	GCPR 2014 Drory2014GCPR

First principled explanation of SGM
- trivial to implement, extremely fast, and high ranking on benchmarks
- still a successful heuristic with no theoretical characterization
Its exact relation to belief propagation and tree-reweighted message passing
- SGM's 8 direction scan-lines is an approximation to the optimal labelling of the entire graph.
- SGM amounts to the first iteration of TRW-T on a MRF with pairwise energies that have been scaled by a constant and known factor.
Outcome: an uncertainty measure for the MAP labeling of an MRF
Qualitative results on Middlebury Benchmark

Reconstruction Multi-view 3D Reconstruction
	Towards Large-Scale City Reconstruction from Satellites [pdf] [slide] Liuyun Duan and Florent Lafarge	ECCV 2016 Duan2016ECCV

3D city models from stereo pair of satellite images in a few minutes
Geometric atomic convex polygons
A joint classification and reconstruction for the semantic class (ground, roof, and facade) and the elevation of each polygon
Experiments on QuickBird2, WorldView2, and Pleiades satellite images
Not as accurate as Lidar, but produces fast, compact, and semantic-aware models robust to low resolution and occlusion problems
Better compared to Digital Surface Models

Motion & Pose Estimation Simultaneous Localization and Mapping
	SegMatch: Segment based loop-closure for 3D point clouds [pdf] [slide] Renaud Dube and Daniel Dugas and Elena Stumm and Juan I. Nieto and Roland Siegwart and Cesar Cadena	ARXIV 2016 Dube2016ARXIV

Loop-closure detection on 3D data
Existing methods based on local features suffer from robustness to environment changes while methods based on global features are viewpoint dependent
Proposes SegMatch, a loop-closure detection algorithm based on the matching of 3D segments

Method:
- extracts and describes segments from a 3D point cloud
- matches them to segments from already visited places
- uses a geometric verification step to propose loop-closures candidates

Advantage of this segment-based technique is its ability to compress the point cloud into a set of distinct and discriminative elements for loop-closure detection
First paper to present a real-time algorithm for performing loop-closure detection and localization in 3D laser data on the basis of segments
Evaluates on KITTI odometry dataset

Motion & Pose Estimation Ego-Motion Estimation
	Direct Sparse Odometry [pdf] [slide] Jakob Engel and Vladlen Koltun and Daniel Cremers	ARXIV 2016 Engel2016ARXIV

The direct and sparse formulation for monocular visual odometry
A fully direct probabilistic model with joint optimization of all model parameters, including camera poses, camera intrinsics, and geometry parameters (inverse depth)
Evaluating the photometric error for each point over a small neighbourhood of pixels
Real-time by omitting the smoothness prior and sampling pixels evenly throughout the images instead
No keypoint detectors or descriptors
Integrating a full photometric calibration accounting for exposure time, lens vignetting, and non-linear response functions
Evaluated on three different datasets comprising several hours of video
Comparison of direct to indirect approach: less robust to geometric noise, but superior accuracy on well-calibrated data

Motion & Pose Estimation Simultaneous Localization and Mapping
	LSD-SLAM: Large-Scale Direct Monocular SLAM [pdf] [slide] Jakob Engel and Thomas Schops and Daniel Cremers	ECCV 2014 Engel2014ECCV

Feature-less monocular SLAM algorithm which allows to build large-scale maps
Novel direct tracking method that detects loop closures and scale-drift using similarity transform in 3D
Direct image alignment with 3D reconstruction in real-time
Pose-graph of keyframes with associated probabilistic semi-dense depth maps
Semi-dense depth maps are obtained by filtering over a large number of pixelwise small-baseline stereo comparisons
Probabilistic solution to include the effect of noisy depth values into tracking
Evaluation on TUM RGB-D benchmark

Motion & Pose Estimation Ego-Motion Estimation
	Large-scale direct SLAM with stereo cameras [pdf] [slide] Jakob Engel and Jorg Stuckler and Daniel Cremers	IROS 2015 Engel2015IROS

Large-Scale Direct SLAM algorithm for stereo cameras (Stereo LSD-SLAM) that runs in real-time
Direct alignment of the images based on photoconsistency of all high contrast pixel in contrast to sparse interest-point based methods
Couple temporal multi-view stereo from monocular LSD-SLAM with static stereo from a fixed-baseline stereo camera setup
Incorporating both disparity source allow to estimate depth of pixels that are under-constrained in fixed baseline stereo
Fixed baseline avoids scale-drift that occurs in monocular SLAM
Robust approach to enforce illumination invariance
State-of-the-art results in KITTI and EuRoC Challenge 3 for micro aerial vehicles

Motion & Pose Estimation Ego-Motion Estimation
	Semi-Dense Visual Odometry for a Monocular Camera [pdf] [slide] J. Engel and J. Sturm and D. Cremers	ICCV 2013 Engel2013ICCV

Real-time visual odometry method for a monocular camera
Continuously estimate a semi-dense inverse depth map which is used to track the motion of the camera
Depth estimation for pixel with non-negligible gradients using multi-view stereo
Each estimate is represented as a Gaussian probability distribution over the inverse depth (corresponds to update step of Kalman filter)
Reference frame is selected such that the observation angle is small
Propagate depth maps from frame to frame (corresponding to prediction step of Kalman filter) and refine with new stereo depth measurements
Whole image alignment using depth estimates for tracking
Comparable tracking performance with fully dense methods without requiring a depth sensor

Object Detection 3D Object Detection from 3D Point Clouds
	Vote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks [pdf] [slide] Martin Engelcke and Dushyant Rao and Dominic Zeng Wang and Chi Hay Tong and Ingmar Posner	ARXIV 2016 Engelcke2016ARXIV

Proposes a computationally efficient approach to detecting objects natively in 3D point clouds using convolutional neural networks

Contributions:
- Construction of efficient convolutional layers as basic building blocks for CNN-based point cloud processing by leveraging a voting mechanism exploiting the inherent sparsity in the input data
- The use of rectified linear units and an L1 sparsity penalty to specifically encourage data sparsity in intermediate representations in order to exploit sparse convolution layers throughout the entire CNN stack
- First work to propose sparse convolutional layers and L1 regularisation for efficient large-scale processing of 3D data

Evaluates on KITTI object detection benchmark

Object Detection Person Detection
	A mixed generative-discriminative framework for pedestrian classification [pdf] [slide] Enzweiler, M. and Gavrila, D.M.	CVPR 2008 Enzweiler2008CVPR

Pedestrian classification utilizing synthesized virtual samples of a learned generative model to enhance a discriminative model
Address bottleneck caused by the scarcity of samples of the target class
Generative model captures prior knowledge about pedestrian class in terms of probabilistic shape and texture models
Selective sampling, by means of probabilistic active learning, guides the training process towards the most informative samples
Virtual samples can be considered as a regularization term to the real data
Signification improvement in classification performance in large-scale real-world datasets

Object Detection Person Detection
	Monocular Pedestrian Detection: Survey and Experiments [pdf] [slide] M. Enzweiler and D. M. Gavrila	PAMI 2009 Enzweiler2009PAMI

Overview of the current state of the art in person detection from both methodological and experimental perspectives
Survey: main components of a pedestrian detection system and the underlying model: hypothesis generation (ROI selection), classification (model matching), and tracking
Experimental study: comparing state-of-the-art systems
Experiments on a dataset captured onboard a vehicle driving through urban environment
Results:
- HOG/linSVM at higher image resolutions and lower processing speeds
- Wavelet-based AdaBoost cascade approach at lower image resolutions and (near) real-time processing speeds
Better performance for all by incorporating temporal integration and/or restrictions of the search space based on scene knowledge

Representations Stixels
	From stixels to objects - A conditional random field based approach [pdf] [slide] Friedrich Erbs and Beate Schwarz and Uwe Franke	IV 2013 Erbs2013IV

Detection and tracking of moving traffic participants from a mobile platform using a stereo camera system
Bayesian segmentation approach based on the Dynamic Stixel World
In real-time using alpha-expansion multi-class graph cut optimization scheme
Integrating 3D and motion features, spatio-temporal prior knowledge, and radar sensor in a CRF
Evaluated quantitatively in various challenging traffic scenes

Representations Stixels
	Stixmentation - Probabilistic Stixel based Traffic Scene Labeling [pdf] [slide] Friedrich Erbs and Beate Schwarz and Uwe Franke	BMVC 2012 Erbs2012BMVC

Detection of moving objects from a mobile platform
Multi-class (street, obstacle, sky) traffic scene segmentation approach based on Dynamic Stixel World, an efficient super-pixel object representation
Each stixel assigned to a quantized maneuver motion or to static background
Using dense stereo depth maps obtained by SGM
Conditional Random Field using 3D and motion features and spatio-temporal prior
Real-time performance and evaluated in various challenging urban traffic scenes

Tracking Tracking with two cameras
	Robust multi-person tracking from a mobile platform [pdf] [slide] A. Ess and B. Leibe and K. Schindler and L. Van Gool	PAMI 2009 Ess2009PAMI

Multi-person tracking in busy pedestrian zones using a stereo rig on a mobile platform
Joint estimation of camera position, stereo depth, object detection, and tracking
Object-object interactions and temporal links to past frames on a graphical model
Two-step approach for intractable inference (approximate):
- First solve a simplified version to estimate the scene geometry and object detections per frame (without interactions and temporal continuity)
- Conditioned on these results, object interactions, tracking, and prediction
Combining Belief Propagation and Quadratic Pseudo-Boolean Optimization
Automatic failure detection and correction mechanisms
Evaluated on challenging real-world data (over 5,000 video frame pairs)
Robust multi-object tracking performance in very complex scenes

Scene Understanding
	Segmentation-Based Urban Traffic Scene Understanding [pdf] [slide] Ess, A. and Mueller, T. and Grabner, H. and L. van Gool	BMVC 2009 Ess2009BMVC

Proposes a method to recognise the traffic scene in front of a moving vehicle with respect to the road topology and the existence of objects

Contributions:
- Uses a two-stage system, where the first stage abstracts the image by a rough super-pixel segmentation of the scene
- Uses this meta representation in a second stage to construct features set for classifier that is able to distinguish between different road types as well as detect the existence of commonly encountered objects
- Shows that by relying on an intermediate stage, can effectively abstract from peculiarities of the underlying image data

Evaluates on two urban data sets, covering day light and dusk conditions

Datasets & Benchmarks Real Data
	The Pascal Visual Object Classes (VOC) Challenge [pdf] [slide] Everingham, M. and Van~Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman, A.	IJCV 2010 Everingham2010IJCV

A benchmark with a standard dataset of images and annotation, and standard evaluation procedures
- Two principal challenges: classification and detection
- Two subsidiary challenges: pixel-level segmentation and person layout estimation
Dataset: challenging images and high quality annotation, with a standard evaluation methodology
- Variability in object size, orientation, pose, illumination, position and occlusion
- No systematic bias for centred objects or good illumination
- Consistent, accurate, and exhaustive annotations for class, bounding box, viewpoint, truncation, and difficult
Competition: measure the state of the art each year

Tracking State-of-the-Art on KITTI
	Improving Multi-frame Data Association with Sparse Representations [pdf] [slide] Loc Fagot-Bouquet and Romaric Audigier and Yoann Dhome and Frederic Lerasle	ECCV 2016 Fagot-Bouquet2016ECCV

Multiple object tracking still difficult due to appearance variations, occlusions and detection failures
Sparse representations-based models successful in single object tracking
Combining a sparse representation-based appearance model with a sliding window tracking method
Formulate the multi-frame data association step as an energy minimization problem
Efficiently exploits sparse representations of all detections
Structured sparsity-inducing norm is used to compute representations more suited to the tracking context
Evaluation on MOTChallenge benchmarks

Semantic Segmentation Semantic Segmentation with Multiple Frames
	Joint 2D-3D temporally consistent semantic segmentation of street scenes [pdf] [slide] Floros, G. and Leibe, B.	CVPR 2012 Floros2012CVPR

Proposes a novel Conditional Random Field (CRF) formulation for the semantic scene labeling problem which is able to enforce temporal consistency between consecutive video frames and take advantage of the 3D scene geometry to improve segmentation quality
Uses 3D scene reconstruction as a means to temporally couple the individual image segmentations, allowing information flow from 3D geometry to the 2D image space

Details:
- Optimizes the semantic labels in a temporal window around the image we are interested in
- Augments the higher-order cliques of the CRF with the sets of pixels that are projections of the same 3D point in different images
- Since these new higher-order cliques contain different projections of the same 3D point, the labels of the pixels inside the clique should be consistent
- Forms a grouping constraint on these pixels

Evaluates on Leuven and City stereo dataset

Reconstruction Multi-view 3D Reconstruction
	Data Processing Algorithms for Generating Textured 3D Building Facade Meshes from Laser Scans and Camera Images [pdf] [slide] Christian Fruh and Siddharth Jain and Avideh Zakhor	IJCV 2005 Frueh2005IJCV

Generating textured facade meshes of cities from a series of vertical 2D surface scans and camera images
Set of data processing algorithms that cope with imperfections and non-idealities
Data is divided into easy-to-handle quasi linear segments and sequential topological order of scans
Depth images are obtained by transforming the divided segments and used to detect Dominant building structures
Large holes are filled by planar, horizontal interpolation for the background and horizontal, vertical interpolation or by copy-paste methods for foreground objects
Demonstrated on a large set of data of downtown Berkeley

Motion & Pose Estimation Simultaneous Localization and Mapping
	Building Rome on a Cloudless Day [pdf] [slide] Frahm, Jan-Michael and Fite-Georgel, Pierre and Gallup, David and Johnson, Tim and Raguram, Rahul and Wu, Changchang and Jen, Yi-Hung and Dunn, Enrique and Clipp, Brian and Lazebnik, Svetlana and Pollefeys, Marc	ECCV 2010 Frahm2010ECCV

Dense 3D reconstruction from unregistered Internet-scale photo collections
3 million images within a day on a single PC
Geometric and appearance constraints to obtain a highly parallel implementation
Extension of appearance-based clustering ¹ and stereo fusion ²
Geometric cluster verification using a fast RANSAC method
Local iconic scene graph reconstruction and dense model computation using views obtained from iconic scene graph
Two orders of magnitude higher performance on an order of magnitude larger dataset than state-of-the-art

^{1. Li, X., Wu, C., Zach, C., Lazebnik, S., Frahm, J.M.: Modeling and recognition of landmark image collections using iconic scene graphs. In: ECCV. (2008)}
^{2. Gallup, D., Pollefeys, M., Frahm, J.M.: 3d reconstruction using an n-layer heightmap. In: DAGM (2010)}

History of Autonomous Driving Autonomous Driving Projects
	Autonomous Driving Goes Downtown [pdf] [slide] Uwe Franke and Dariu Gavrila and Steffen Gorzig and Frank Lindner and Frank Paetzold and Christian Wohler	IS 1998 Franke1998IS

Reconstruction Stereo
	Real-Time Stereo Vision for Urban Traffic Scene Understanding [pdf] [slide] U. Franke and A. Joos	IV 2000 Franke2000IV

Presents a precise correlation-based stereo vision approach that allows real-time interpretation of traflc scenes and autonomous Stop & Go on a standard PC
The high speed is achieved by means of a multi-resolution analysis
It delivers the stereo disparities with sub-pixel accuracy and allows precise distance estimates

Develops two different stereo approaches:
- Real-Time Stereo Analysis based on Local Features
- Real-Time Stereo Analysis based on Correlation

Shows applications of stereo approaches to obstacle detection and tracking and analysis of free space in front of the car
Evaluates on self-recorded dataset

Motion & Pose Estimation 3D Motion Estimation -- Scene Flow
	6D-Vision: Fusion of Stereo and Motion for Robust Environment Perception [pdf] [slide] Franke, Uwe and Rabe, Clemens and Badino, Hernan and Gehrig, Stefan	DAGM 2005 Franke2005DAGM

Obstacle avoidance in mobile robotics needs a robust perception of the environment
Simultaneous estimation of depth and motion for image sequences
3D position and 3D motion are estimated with Kalman-Filters
Ego-motion is assumed to be known (they use inertial sensors)
2000 points are tracked with KLT tracker
Multiple filters with different initializations improve the convergence rate
Only qualitative results
Runs in real-time

Datasets & Benchmarks Real Data
	A New Performance Measure and Evaluation Benchmark for Road Detection Algorithms [pdf] [slide] Jannik Fritsch and Tobias Kuehnl and Andreas Geiger	ITSC 2013 Fritsch2013ITSC

Open-access dataset and benchmark for road area and ego-lane detection
Motivation: finding the boundaries of unmarked or weakly marked roads and lanes as they appear in inner-city and rural environments
600 annotated training and test images of high variability from three challenging real-world city road types derived from the KITTI dataset
Evaluation using 2D Birds Eye View (BEV) space
Behavior-based metric by fitting a driving corridor to road detection results in the BEV
Comparison of state-of-the-art road detection algorithms using classical pixel-level metrics in perspective and BEV space as well as the novel behavior-based performance measure

History of Autonomous Driving Autonomous Driving Projects
	Toward automated driving in cities using close-to-market sensors: An overview of the V-Charge Project [pdf] [slide] Paul Timothy Furgale and Ulrich Schwesinger and Martin Rufli and Wojciech Derendarz and Hugo Grimmett and Peter Muhlfellner and Stefan Wonneberger and Julian Timpner and Stephan Rottmann and Bo Li and Bastian Schmidt and Thien-Nghia Nguyen and Elena Cardarelli and Stefano Cattani and Stefan Bruning and Sven Horstmann and Martin Stellmacher and Holger Mielenz and Kevin Koser and Markus Beermann and Christian Hane and Lionel Heng and Gim Hee Lee and Friedrich Fraundorfer and Rene Iser and Rudolph Triebel and Ingmar Posner and Paul Newman and Lars C. Wolf and Marc Pollefeys and Stefan Brosig and Jan Effertz and Cedric Pradalier and Roland Siegwart	IV 2013 Furgale2013IV

Electric automated car outfitted with close-to-market sensors
Fully operational system including automated navigation and parking
Dense map obtained from motion stereo and a volumetric grid
Sparse map is built from state-of-the-art SLAM
Road network represented by RoadGraph, a directed graph of connected lanes, parking lots and other semantic annotations
Localization by extensive data association between sparse map and observed frame
Situational awareness with a robust and accurate scene reconstruction using dense stereo, object detection and tracking, and map fusion
Path planing and motion control with a hierarchical approach consisting of a mission planer, specific processors for onlane driving and parking maneuvers and a motion control module

Representations Stixels
	Stixels Motion Estimation without Optical Flow Computation [pdf] [slide] Bertan Gunyel and Rodrigo Benenson and Radu Timofte and Luc J. Van Gool	ECCV 2012 Guenyel2012ECCV

Traditionally, motion estimation between two frames is done using optical flow methods, which are computationally expensive

Contributions:
- Proposes the first algorithm for stixels motion estimation without requiring the computation of optical flow. This enables much faster computation while keeping good quality
- The stixel motion can be viewed as a matching problem between stixels in 2 frames
- Computes matching cost matrix. Optimal motion assignment for each stixel can be solved via dynamic programming
- Presents the first evaluation of the stixels motion quality by comparing against two baselines

Evaluates on the ``Bahnhof" sequence

Semantic Segmentation Semantic Segmentation
	Superpixel Convolutional Networks Using Bilateral Inceptions [pdf] [slide] Raghudeep Gadde and Varun Jampani and Martin Kiefel and Daniel Kappler and Peter V. Gehler	ECCV 2016 Gadde2016ECCV

Adding bilateral filtering to CNNs for semantic segmentation: "Bilateral Inception" (BI)
Idea: Pixels that are spatially and photometrically similar are more likely to have the same label.
End-to-end learning of feature spaces for bilateral filtering and other parameters
Standard bilateral filters with Gaussian kernels, at different feature scales
Information propagation between (super) pixels while respecting image edges
Full resolution segmentation result from the lower resolution solution of a CNN
Inserting BI into several existing CNN architectures before/after the last 1 times 1 convolution (FC) layers
Improved results on Pascal VOC12, Materials in Context, and Cityscapes datasets
Better and faster than DenseCRF

Semantic Segmentation Semantic Segmentation of Facades
	Efficient 2D and 3D Facade Segmentation using Auto-Context [pdf] [slide] Raghudeep Gadde and Varun Jampani and Renaud Marlet and Peter V. Gehler	ARXIV 2016 Gadde2016ARXIV

Datasets & Benchmarks Synthetic Data
	Virtual Worlds as Proxy for Multi-Object Tracking Analysis [pdf] [slide] Gaidon, Adrien and Wang, Qiao and Cabon, Yohann and Vig, Eleonora	CVPR 2016 Gaidon2016CVPR

Modern CV algorithms rely on expensive data acquisition and manual labeling
Generation of fully labeled, dynamic and photo-realistic proxy virtual worlds
Allow to change conditions of the proxy world and to study rare events or difficult to observe conditions that might occur in practice (what-if analysis)
Efficient real-to-virtual world cloning method validated by creating a dataset called Virtual KITTI
Accurate ground truth for object detection, tracking, scene and instance segmentation, depth and optical flow
Gap in performance between leaning from real and virtual KITTI is small
Pre-training with Virtual KITTI and final training with KITTI gave best results (virtual data augmentation)

Reconstruction Stereo
	Variable baseline/resolution stereo [pdf] [slide] Gallup, D. and Frahm, J. M. and Mordohai, P. and Pollefeys, M.	CVPR 2008 Gallup2008CVPR

Presents a novel multi-baseline, multi-resolution stereo method, which varies the baseline and resolution proportionally to depth to obtain a reconstruction in which the depth error is constant
In contrast to traditional stereo, in which the error grows quadratically with depth, which means that the accuracy in the near range far exceeds that of the far range

By selecting an appropriate baseline and resolution (image pyramid), the algorithm computes a depthmap which has these properties:
- the depth accuracy is constant over the reconstructed volume, by increasing the baseline to increase accuracy in the far range
- the computational effort is spread evenly over the volume by reducing the resolution in the near range
- the angle of triangulation is held constant w.r.t. depth

Evaluates on self-recorded dataset

Reconstruction Stereo
	Piecewise planar and non-planar stereo for urban scene reconstruction [pdf] [slide] Gallup, David and Frahm, Jan-Michael and Pollefeys, Marc	CVPR 2010 Gallup2010CVPR

Depth estimation in indoor and urban outdoor scenes
Planarity assumptions are problematic in presence of non-planar objects
Stereo method capable of handling more general scenes containing planar and non-planar regions
Segmentation by multi-view photoconsistency and color-/texture-based classifier into piecewise planar and non-planar regions
Standard multi-view stereo used to model non-planar regions
Fusion of plane hypotheses across multiple overlapping views ensure consistent 3D reconstruction
Tested with street-side sequences captured by two vehicle-mounted color-cameras

Tracking
	Multi-cue pedestrian detection and tracking from a moving vehicle [pdf] [slide] Gavrila, D. M. and Munder, S.	IJCV 2007 Gavrila2007IJCV

Mutli-cue system for real-time detection and tracking of pedestrians from a moving vehicle
Cascade of modules utilizing complementary visual criteria to narrow down the search space
Integration of sparse stereo-based ROI generation, shape-based detection, texture-based classification and dense stereo-based verification
Mixture-of-experts involving texture-based component classifiers weighted by the outcome of shape matching
alpha-beta tracker using the Hungarian method for data association
Analysis of the performance and interaction of the individual modules
Evaluation in difficult urban traffic conditions

Reconstruction Stereo
	A Real-Time Low-Power Stereo Vision Engine Using Semi-Global Matching. [pdf] [slide] Gehrig, Stefan K. and Eberli, Felix and Meyer, Thomas	ICVS 2009 Gehrig2009ICVS

Low-power implementations of real-time stereo vision systems not available in existing literature

Contributions:
- Introduces a real-time low-power global stereo engine based on semi-global matching (SGM)
- Achieves real time performance by parallelization of the path calculator block and subsampling of the images

Evaluates on Middlebury database

Motion & Pose Estimation Simultaneous Localization and Mapping
	Monocular road mosaicing for urban environments [pdf] [slide] Andreas Geiger	IV 2009 Geiger2009IV

Marking-based lane recognition require unobstructed view onto the road which usually is not possible due to traffic
Multi-stage registration procedure for road mosaicing in dynamic environments
Approximating the road surface by a plane allows to use homographies for the mapping from one image to another
Picking a subset as keyframes to reduce error accumulation and save computational power
Road segmentation using optical flow on Harris corners
Combine road images using multi-band blending to remove artificial edges

History of Autonomous Driving Autonomous Driving Competitions
	Team AnnieWAY's entry to the Grand Cooperative Driving Challenge 2011 [pdf] [slide] Andreas Geiger and Martin Lauer and Frank Moosmann and Benjamin Ranft and Holger Rapp and Christoph Stiller and Julius Ziegler	TITS 2012 Geiger2012TITS

Presents the concepts and methods developed for the autonomous vehicle AnnieWAY, winning entry to the Grand Cooperative Driving Challenge of 2011
Goal of cooperative driving is to improve traffic homogeneity using vehicle-to-vehicle communication to provide the vehicle with information about the current traffic situation

Contributions:
- Describes algorithms used for sensor fusion, vehicle-to-vehicle communication and cooperative control
- Analyzes the performance of the proposed methods and compare them to those of competing teams

Scene Understanding
	3D Traffic Scene Understanding from Movable Platforms [pdf] [slide] Andreas Geiger and Martin Lauer and Christian Wojek and Christoph Stiller and Raquel Urtasun	PAMI 2014 Geiger2014PAMI

Presents a probabilistic generative model for multi-object traffic scene understanding from movable platforms
Reasons jointly about the 3D scene layout as well as the location and orientation of objects in the scene

Contributions:
- Estimates the layout of urban intersections based on onboard stereo imagery alone
- Does not rely on strong prior knowledge such as intersection maps
- Infers all information from different types of visual features that describe the static environment of the crossroads & the motions of objects in the scene

Evaluates on dataset of 113 video sequences of real traffic

Datasets & Benchmarks Real Data
	Vision meets Robotics: The KITTI Dataset [pdf] [slide] Andreas Geiger and Philip Lenz and Christoph Stiller and Raquel Urtasun	IJRR 2013 Geiger2013IJRR

Datasets & Benchmarks Real Data
	Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite [pdf] [slide] Andreas Geiger and Philip Lenz and Raquel Urtasun	CVPR 2012 Geiger2012CVPR

Autonomous driving platform equipped with video cameras, Velodyne scanner & GPS
Goal: provide novel benchmarks for several tasks
- Stereo & optical flow: 389 image pairs
- Stereo visual odometry: sequences of 39.2 km total length
- 2D & 3D object detection: vehicles, pedestrians, cyclists (>200k annotations)
Online evaluation server (held-out test ground truth)
Conclusions: novel challenges and ranking compared to lab conditions (eg, Middlebury)

Cameras Models & Calibration Calibration
	Automatic Calibration of Range and Camera Sensors using a single Shot [pdf] [slide] Andreas Geiger and Frank Moosmann and Oemer Car and Bernhard Schuster	ICRA 2012 Geiger2012ICRA

Set up of calibrated systems heavily delay robotic research
Toolbox with web interface for fully automatic camera-to-camera and camera-to-range calibration using plane checkerboard patterns
Recovers intrinsic and extrinsic camera parameters as well as transformation between cameras and range sensors within one minute
Checkerboard corner detector significantly outperforms state-of-the-art
Validation using a variety of sensors such as cameras, Kinect, and Velodyne laser scanner

Reconstruction Stereo
	Efficient Large-Scale Stereo Matching [pdf] [slide] Geiger, Andreas and Roser, Martin and Urtasun, Raquel	ACCV 2010 Geiger2010ACCV

Fast stereo matching for high-resolution images
Efficient, parallel algorithm in a reduced search space
Building a prior on the disparities
- Robustly matched points used to form a triangulation (support points)
- Reducing the matching ambiguities of the remaining points
- Piecewise linear: robust to poorly-textured and slanted surfaces
Automatic detection of disparity range
Significantly lower matching entropy compared to using a uniform prior
1 sec for a 1 Megapixel image pair on a single CPU
State-of-the-art with significant speed-ups on large-scale Middlebury benchmark

Scene Understanding Indoor 3D Scene Understanding
	Joint 3D Object and Layout Inference from a single RGB-D Image [pdf] [slide] Andreas Geiger and Chaohui Wang	GCPR 2015 Geiger2015GCPR

Inferring 3D objects and the layout of indoor scenes from a single RGB-D image captured with a Kinect camera
A high-order graphical model to jointly reason about the layout, objects and superpixels in the image
Sampling accurate 3D CAD proposals directly from the unary distribution
Explicitly modelling visibility and occlusion constraints
Improvements with respect to two custom baselines and state-of-the-art

Reconstruction Multi-view 3D Reconstruction
	StereoScan: Dense 3D Reconstruction in Real-time [pdf] [slide] Andreas Geiger and Julius Ziegler and Christoph Stiller	IV 2011 Geiger2011IV

Real-time 3D reconstruction from high-resolution stereo sequences using visual odometry
Sparse feature matching using blob, corner detector and descriptors
Egomotion estimation by minimizing the reprojection error and refining with Kalman filter
Dense 3D reconstruction by projecting image points into 3D and associating the projected points
Visual odometry runs at 25fps and 3D reconstruction at 3-4fps
Evaluation on the Karlsruhe dataset to GPS+IMU data and a freely available visual odometry library

Object Detection Person Detection
	Survey on Pedestrian Detection for Advanced Driver Assistance Systems [pdf] [slide] David Geronimo and Antonio M. Lopez and Angel D. Sappa and Thorsten Graf	PAMI 2010 Geronimo2010PAMI

In this paper, the focus is on a particular type of ADAS, pedestrian protection systems (PPSs).
The objective of a PPS is to detect the presence of both stationary and moving people in a specific area of interest around the moving host vehicle in order to warn the driver

Presents a general module-based architecture that simplifies the comparison of specific detection tasks
Provides a comprehensive up- to-date review of state-of-the-art sensors and benchmarking
Reviews different approaches according to the specific tasks defined in the aforementioned architecture
Major progress has been made in pedestrian classification, mainly due to synergy with generic object detection and applications such as face detection and surveillance

Cameras Models & Calibration Omnidirectional Cameras
	A unifying theory for central panoramic systems and practical implications [pdf] [slide] Christopher Geyer and Kostas Daniilidis	ECCV 2000 Geyer2000ECCV

Provides a unifying theory for all central catadioptric systems, that means for all catadioptric systems with a unique effective viewpoint
Shows that all of them are isomorphic to projective mappings from the sphere to a plane with a projection center on the perpendicular to the plane
This unification is novel & has significant impact on the 3D interpretation of images
Presents new invariances inherent in parabolic projections and a unifying calibration scheme from one view
Describes the advantages of catadioptric systems & explain why images arising in central catadioptric systems contain more information than images from conventional cameras
One example is that intrinsic calibration from a single view is possible for parabolic catadioptric systems given only three lines

Semantic Segmentation Semantic Segmentation
	Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation [pdf] [slide] Golnaz Ghiasi and Charless C. Fowlkes	ECCV 2016 Ghiasi2016ECCV

A multi-resolution reconstruction (from low to full resolution) architecture for semantic segmentation
Significant sub-pixel localization information in high-dimensional features
- Sub-pixel up-sampling using a class-specific reconstruction basis
- Substantially improves over common up-sampling schemes
Laplacian pyramid using skip connections from higher resolution feature maps
Reducing the effect of shallow, high-resolution layers by using them only to correct residual errors in the low-resolution prediction (like ResNets)
Multiplicative gating to avoid integrating noisy high-resolution outputs
State-of-the-art results on the PASCAL VOC and Cityscapes benchmarks

Tracking
	A Bayesian Framework for Multi-cue 3D Object Tracking [pdf] [slide] J. Giebel and D.M. Gavrila and C. Schnorr	ECCV 2004 Giebel2004ECCV

Multi-cue 3D deformable object tracking from a moving vehicle
Spatio-temporal shape representation by a set of distinct linear subspace models Dynamic Point Distribution Models (DPDMs)
- Continuous and discontinuous appearance changes
- Learned fully automatically from training data
Texture information by means of intensity histograms , compared using the Bhattacharyya coefficient
Direct 3D measurement by a stereo system
State propagation by a particle filter combining shape, texture and depth in its observation density function
Measurements from an independent object detection by means of importance sampling
Evaluated in urban, rural, and synthetic environments

Motion & Pose Estimation Simultaneous Localization and Mapping
	Integrating metric and semantic maps for vision-only automated parking [pdf] [slide] Hugo Grimmett and Mathias Burki and Lina Mara Paz and Pedro Pinies and Paul Timothy Furgale and Ingmar Posner and Paul Newman	ICRA 2015 Grimmett2015ICRA

Creating metric maps and semantic maps
Missing in the literature: how to update the semantic layer as the metric map evolves
Unsupervised evolution of both maps as the environment is revisited by the robot
Distinguishing between static and dynamic maps
Using vision-only sensors and reduced human labelling of semantic maps in case of safety-critical situations
Automatically generating road network graphs
Evaluated on two different car parks with a fully automated car, performing repeated automated parking manoeuvres (V-Charge project)

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	Deep Discrete Flow [pdf] [slide] Fatma Gney and Andreas Geiger	ACCV 2016 Guney2016ACCV

Learning features for optical flow by training a CNN for feature matching on image patches
Large receptive field size via dilated convolutions
A context network (dilated convolutions) trained on the output of a local network (regular convolutions)
Fast exact matching on GPU
Discrete flow framework
Regular BP with 300 proposals
Evaluated on Sintel and KITTI benchmarks

Reconstruction Stereo
	Displets: Resolving Stereo Ambiguities using Object Knowledge [pdf] [slide] Fatma Gney and Andreas Geiger	CVPR 2015 Guney2015CVPR

Using object-category specific disparity proposals (displets) to compensate for the weak data term on the reflecting and textureless surfaces
Displets as non-local regularizer for the challenging object class 'car' in a superpixel based CRF framework
Sampling displets using inverse graphics techniques based on a sparse disparity estimate and a semantic segmentation of the image
Representative set of 3D CAD models of cars from Google Warehouse (8 models)
Mesh simplification of 3D CAD models for preserving the hull of the object
The best performing method on KITTI stereo benchmark, but slow

Cameras Models & Calibration Omnidirectional Cameras
	Real-Time Direct Dense Matching on Fisheye Images Using Plane-Sweeping [pdf] [slide] Christian Hane and Lionel Heng and Gim Hee Lee and Alexey Sizov and Marc Pollefeys	THREEDV 2014 Haene2014THREEDV

An adaptation of camera projection models for fisheye cameras into the plane-sweeping stereo matching algorithm
Depth maps computed directly from the fisheye images to cover a larger part of the scene with fewer images
Plane-sweeping approach over rectification:
- Suitable for more than two images
- Well-suited to GPUs fro real-time performance
Requirement: Efficient projection and unprojection
Two different camera models: the unified projection and the field-of-view (FOV)
Unified projection model also works for other non-pinhole cameras such as omnidirectional and catadioptric cameras.
Simple, real-time approach for full, good quality and high resolution depth maps

Semantic Segmentation Road/Lane Detection
	Obstacle detection for self-driving cars using only monocular cameras and wheel odometry [pdf] [slide] Christian Hane and Torsten Sattler and Marc Pollefeys	IROS 2015 Haene2015IROS

Extracting static obstacles from depth maps computed from monocular fisheye cameras parked cars and signposts, the amount of free space, distance between obstacles, the size of an empty parking spot
Motivation: Affordable, reliable, accurate, and real-time detection of obstacles
Two approaches: Active methods using sensors such as laser scanners, time-of-flight, structured light or ultrasound and passive methods using camera images
No need for accurate visual inertial odometry estimation, only available wheel odometry
Steps:
- Depth estimation for each camera using multi-view stereo matching
- Obstacle detection in 2D
- Fusing the obstacle detections over several camera frames to handle uncertainty
Accurate enough for navigation purposes of self-driving cars

Datasets & Benchmarks Real Data
	Fast semantic segmentation of 3d point clouds with strongly varying density [pdf] [slide] Timo Hackel and Jan D. Wegner and Konrad Schindler	APRS 2016 Hackel2016APRS

Semantic segmentation of 3D point clouds
Unstructured and inhomogeneous point clouds (LiDAR, photogammetric reconstruction)
Features from neighbourhood relations
- A multi-scale pyramid with decreasing point density
- A separate search structure per scale level
Random Forest classifier to predict class-conditional probabilities
Point clouds with many millions of points in a matter of minutes (< 4 minutes per 10 million points)
Evaluated on
- benchmark data from a mobile mapping platform (Paris-Rue-Cassette and Paris-Rue-Madame)
- a variety of large, terrestrial laser scans with greatly varying point density

Reconstruction Reconstruction & Recognition
	Class Specific 3D Object Shape Priors Using Surface Normals [pdf] [slide] Haene, Christian and Savinov, Nikolay and Pollefeys, Marc	CVPR 2014 Haene2014CVPR

Dense 3D reconstruction of real world objects
General smoothness priors such as surface area regularization can lead to defects
Exploit the object class specific local surface orientation to solve this problem
Object class specific shape prior in form of spatially varying anisotropic smoothness term
Discrete Wulff shapes allow general enough parametrization for anisotropic smoothness
Parameters are extracted from training data
Directly fits into volumetric multi-label reconstruction approaches
Allows a segmentation between the object and its supporting grounds
Evaluated on synthetic data and real world sequences

Reconstruction Reconstruction & Recognition
	Joint 3D Scene Reconstruction and Class Segmentation [pdf] [slide] Christian Haene and Christopher Zach and Andrea Cohen and Roland Angst and Marc Pollefeys	CVPR 2013 Haene2013CVPR

Proposes a rigorous mathematical framework to formulate and solve a joint segmentation and dense reconstruction problem

Contributions:
- Demonstrates that joint image segmentation and dense 3D reconstruction is beneficial for both tasks
- Introduces a rigorous mathematical framework to formulate and solve this joint optimization task.
- Extends volumetric scene reconstruction methods to a multi-label volumetric segmentation framework

Evaluates on castle P-30 dataset

Reconstruction Stereo
	A Patch Prior for Dense 3D Reconstruction in Man-Made Environments [pdf] [slide] Christian Haene and Christopher Zach and Bernhard Zeisl and Marc Pollefeys	THREEDIMPVT 2012 Haene2012THREEDIMPVT

Dense 3D reconstructions suffer from weak and ambiguous observations in man-made environments that can be solved with strong, domain-specific priors
Powerful prior directly modeling the expected local surface-structure without the need for higher-order MRFs
Using a small patch dictionary as by patch-based representations used in image processing
Energy can be optimized using an efficient first-order primal dual algorithm
The patch dictionary and priors on dictionary coefficients are known
Demonstrate the prior for dense reconstruction of 3D models using stereo and fusion of multiple depth maps on synthetic data and real data

Semantic Segmentation Semantic Instance Segmentation
	Shape-aware Instance Segmentation [pdf] [slide] Zeeshan Hayder and Xuming He and Mathieu Salzmann	ARXIV 2016 Hayder2016ARXIV

Instance-level semantic segmentation
Methods typically propose candidate objects and directly predict a binary mask
Cannot recover from errors in candidate generation like too small or shifted boxes
Novel object segment representation based on the distance transform of the object masks
Object mask network with a new residual-deconvolution architecture infers such representation and decodes it into the final binary mask
Integration into a Multitask Network Cascade framework and training end-to-end of the shape-aware instance segmentation network
Outperforms state-of-the-art in object proposal generation and instance segmentation in PASCAL VOC 2012 and CityScapes

Cameras Models & Calibration Calibration
	Leveraging Image-based Localization for Infrastructure-based Calibration of a Multi-camera Rig [pdf] [slide] Lionel Heng and Paul Timothy Furgale and Marc Pollefeys	JFR 2015 Heng2015JFR

Efficient, robust, completely unsupervised infrastructure-based calibration method for calibration of a multi-camera rig
- Efficient, near real-time
- No modification of the infrastructure (or calibration area)
- By using natural features instead of known fiducial markings
- Completely unsupervised
- No initial guesses for the extrinsic parameters
- Without assuming overlapping fields of view
Using a map of a chosen calibration area via SLAM-based self-calibration (one-time run)
Leveraging image-based localization
Significantly improved version of Heng2013IROS Differences to :
- Robust 6D pose graph optimization
- Improved feature matching
- More improvements related to joint optimization
Extensive experiments to quantify the accuracy and repeatability of the extrinsics
Evaluation of the accuracy of the map

Cameras Models & Calibration Calibration
	CamOdoCal: Automatic intrinsic and extrinsic calibration of a rig with multiple generic cameras and odometry [pdf] [slide] Lionel Heng and Bo Li and Marc Pollefeys	IROS 2013 Heng2013IROS

A full automatic pipeline for both intrinsic calibration for a generic camera and extrinsic calibration for a rig with multiple generic cameras and odometry
- Without the need for GPS/INS and the Vicon motion capture system
Intrinsic calibration for each generic camera using a chessboard
Extrinsic calibration to find all camera-odometry transforms
- Monocular VO for each camera using five-point algorithm and linear triangulation
- Robust initial estimate of camera-odometry transform robust to poor-feature areas
- 3D point triangulation
- Finding local inter-camera feature point correspondences for consistency
- Loop closure detection using a vocabulary tree
- Full bundle adjustment which optimizes all intrinsics, extrinsics, odometry poses, and 3D scene points
Globally-consistent sparse map of landmarks which can be used for visual localization
Highly accurate, automated, adaptable calibration for arbitrary, large-scale environments

Reconstruction Stereo
	Stereo Processing by Semiglobal Matching and Mutual Information [pdf] [slide] Hirschmller, Heiko	PAMI 2008 Hirschmueller2008PAMI

A pixel-wise, Mutual Information (MI)-based matching cost
Cost aggregation as approximation of a global, 2D smoothness constraint by combining many 1D constraints
- Two terms by using a lower penalty for small changes
Disparity computation as WTA and by disparity refinements as consistency checking and sub-pixel interpolation
- Propagating valid disparities along paths from eight directions
Multi-baseline matching by fusion of disparities
Further disparity refinements: peak filtering, intensity consistent disparity selection, and gap interpolation
Matching almost arbitrarily large images
Fusion of several disparity images using orthographic projection

Reconstruction Stereo
	Evaluation of Cost Functions for Stereo Matching [pdf] [slide] H. Hirschmller and D. Scharstein	CVPR 2007 Hirschmueller2007CVPR

Evaluation of the insensitivity of different matching costs with respect to radiometric variations for stereo correspondence methods
Pixel-based and window-based variants are considered
Sampling-insensitive absolute differences, three filter-based costs, hierarchical mutual information and normalized cross-correlation
Measure the performance in the presence of global intensity changes, local intensity changes, and noise
Different costs are evaluated with local, semi-global and global stereo methods
Using Middlebury stereo dataset with ground-truth disparities and six new datasets taken under controlled changes of exposure and lighting
Filter-based costs performed best with local radiometric variations but have blurry edges whereas HMI has sharp edges

Scene Understanding
	Recovering Surface Layout from an Image [pdf] [slide] Hoiem, Derek and Efros, Alexei A. and Hebert, Martial	IJCV 2007 Hoiem2007IJCV

Constructing the surface layout via a labelling of the image into geometric classes
- main classes (support, vertical, sky) and subclasses of vertical (left, center, right, porous, solid)
Appearance-based models for each class through multiple segmentations
- A wide variety of image cues including position, color, texture, and perspective
- Multiple segmentations for the spatial support, useful especially for subclasses
Applicable to a wide variety of outdoor scenes and generalizable to indoor scenes

Semantic Segmentation Semantic Segmentation of 3D Data
	Point Cloud Labeling using 3D Convolutional Neural Network [pdf] [slide] Jing Huang and Suya You	ICPR 2016 Huang2016ICPR

Labelling 3D point clouds using a 3D CNN
Motivation:
- Projecting 3D to 2D: loss of important 3D structural information
- No segmentation step or hand-crafted features
An end-to-end segmentation method based on voxelized data
- Voxelization to generate occupancy voxel grids centered at a set of keypoints
- 3D CNN: two 3D convolutional layers, two 3D max-pooling layers, a fully connected layer and a logistic regression layer
Experiments on a large Lidar point cloud dataset of the urban area of Ottawa with 7 categories

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks [pdf] [slide] Eddy Ilg and Nikolaus Mayer and Tonmoy Saikia and Margret Keuper and Alexey Dosovitskiy and Thomas Brox	ARXIV 2016 Ilg2016ARXIV

Improving end-to-end optical flow estimation with a CNN
A learning schedule consisting of multiple datasets
- Training on Chairs first and fine-tuning on Things3D
- FlowNetC outperforms FlowNetS
A stacked architecture by warping of the second image with intermediate optical flow
Different variants of the network (trade-off between accuracy and speed)
A sub-network specializing on small motions trained on a special dataset
Adding another network that learns to fuse the stacked network with the small displacement network
Better than FlowNet and on par with state-of-the-art on Sintel and KITTI

Semantic Segmentation Semantic Segmentation of Facades
	Efficient Facade Segmentation Using Auto-context [pdf] [slide] Varun Jampani and Raghudeep Gadde and Peter V. Gehler	WACV 2015 Jampani2015WACV

Problem: Segmentation for 2D images and 3D point clouds of building facades

Existing methods make use of domain-specific knowledge as strong prior information

Contributions:
- Inroduces generic segmentation method that ignores domain knowledge
- Shows good segmentation results by pixel classifications methods that use basic image features in conjunction with auto-context features
- Proposes system of a sequence of boosted decision tree classifiers stacked using auto-context features

Scene Understanding Indoor 3D Scene Understanding
	3D-Based Reasoning with Blocks, Support, and Stability [pdf] [slide] Jia, Zhaoyin and Gallagher, Andrew and Saxena, Ashutosh and Chen, Tsuhan	CVPR 2013 Jia2013CVPR

3D volumetric reasoning from RGB-D images using 3D block units
Fit image segments with 3D blocks
Iteratively evaluate the scene based on block interaction properties:newline Intersections, supportive relations and the stability of the scene given the boxes
Joint optimization over segmentations, block fitting, supporting relations and object stability
Evaluation on several RGB-D datasets ¹ including controlled and real indoor scenarios

^{1. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from RGBD images. In ECCV, 2012.}

Motion & Pose Estimation Simultaneous Localization and Mapping
	iSAM2: Incremental Smoothing and Mapping Using the Bayes Tree [pdf] [slide] Michael Kaess and Hordur Johannsson and Richard Roberts and Viorela Ila and John J. Leonard and Frank Dellaert	IJRR 2012 Kaess2012IJRR

Presents a novel data structure, the Bayes tree, that provides an algorithmic foundation enabling a better understanding of existing graphical model inference algorithms and their connection to sparse matrix factorization methods

Contributions:
- Bayes tree encodes a factored probability density, but unlike the clique tree it is directed and maps more naturally to the information matrix of the simultaneous localization and mapping problem
- Shows how the fairly abstract updates to a matrix factorization translate to a simple editing of the Bayes tree and its conditional densities
- Applies the Bayes tree to obtain a novel algorithm for sparse nonlinear incremental optimization, which achieves improvements in efficiency through incremental variable re-ordering & relinearization

Evaluates on a range of real and simulated datasets like Manhattan, Killian Court and City20000

Motion & Pose Estimation Simultaneous Localization and Mapping
	iSAM: Incremental Smoothing and Mapping [pdf] [slide] Michael Kaess and Ananth Ranganathan and Frank Dellaert	TR 2008 Kaess2008TR

Simultaneous localization and mapping
Requirements for SLAM: incremental, real-time, applicable to large-scale environments, and online data association
An incremental smoothing and mapping approach based on fast incremental matrix factorization
Efficient and exact solution by updating a QR factorization of the naturally sparse smoothing information matrix
Recalculating only the matrix entries that actually change
Periodic variable reordering to avoid unnecessary fill-in (trajectories with many loops)
Estimation of relevant uncertainties for online data association
Evaluation on various simulated and real-world datasets for both landmark and pose-only settings

Motion & Pose Estimation Localization
	Alignment of 3D point clouds to overhead images [pdf] [slide] R. S. Kaminsky and Noah Snavely and Steven M. Seitz and Richard Szeliski	CVPRWORK 2009 Kaminsky2009CVPRWORK

Addresses the problem of automatically aligning structure-from-motion reconstructions to overhead images, such as satellite images, maps and floor plans, generated from an orthographic camera

Contributions:
- Computes the optimal alignment using an objective function that matches 3D points to image edges
- Imposes free space constraints based on the visibility of points in each camera

Evaluates on several outdoor and indoor scenes using both satellite and floor plan images

Motion & Pose Estimation Localization
	PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization [pdf] [slide] Alex Kendall and Matthew Grimes and Roberto Cipolla	ICCV 2015 Kendall2015ICCV

Robust and real-time monocular relocalization system
23 layer deep convnet to regress the 6-DOF camera pose from a RGB image in an end-to-end manner
Transfer learning from large scale classification data (training a pose regressor, pre-trained as a classifier on immense recognition datasets)
Using SfM to automatically generate camera poses from a video of the scene
Mapping feature vectors to pose which generalizes to unseen scenes with a few additional training samples
Evaluated on both indoors (7 Scenes dataset) and outdoors in real time, (5ms per frame)
An outdoor urban localization dataset with 5 scenes: Cambridge Landmarks
Robust to difficult lighting, motion blur and different camera intrinsics where point based SIFT registration fails

Semantic Segmentation Semantic Instance Segmentation
	InstanceCut: from Edges to Instances with MultiCut [pdf] [slide] Alexander Kirillov and Evgeny Levinkov and Bjoern Andres and Bogdan Savchynskyy and Carsten Rother	ARXIV 2016 Kirillov2016ARXIV

Instance-aware semantic segmentation
Challenges:
- Meaningless labels (car number 5)
- Varying number of objects
- Set of pixels vs. bounding boxes
- Need for a more refined labelling of the training data (rare classes)
An instance-agnostic semantic segmentation using CNNs
All instance-boundaries using a new instance-aware edge detection model
Combined into a novel MultiCut formulation
Evaluated on CityScapes: particularly well for rare object classes
Not handled: instances that are formed by disconnected regions in the image

Motion & Pose Estimation Ego-Motion Estimation
	Visual Odometry based on Stereo Image Sequences with RANSAC-based Outlier Rejection Scheme [pdf] [slide] Bernd Kitt and Andreas Geiger and Henning Lategahn	IV 2010 Kitt2010IV

Well distributed corner-like feature matches due to bucketing
Using trifocal geometry the egomotion is estimated
Iterated Sigma Point Kalman Filter yields robust frame-to-frame motion estimation
Outlier are rejected with RANSAC-based approach
Outperforms other filtering techniques in accuracy and run time
Evaluated on simulated and real world data with INS trajectories

Datasets & Benchmarks Real Data
	The HCI Benchmark Suite: Stereo and Flow Ground Truth With Uncertainties for Urban Autonomous Driving [pdf] [slide] Kondermann, Daniel and Nair, Rahul and Honauer, Katrin and Krispin, Karsten and Andrulis, Jonas and Brock, Alexander and Gussefeld, Burkhard and Rahimimoghaddam, Mohsen and Hofmann, Sabine and Brenner, Claus and Jahne, Bernd	CVPRWORK 2016 Kondermann2016CVPRWORK

Stereo and optical flow dataset to complement existing benchmarks
Representative for urban autonomous driving, including realistic systematically varied radiometric and geometric challenges
Evaluation of the ground truth accuracy with Monte Carlo simulations
Interquartile ranges are used as uncertainty measure
Binary masks for dynamically moving regions are supplied with estimated stereo and flow
Initial benchmark consists of 55 manually selected sequences between 19 and 100 frames
Interactive tools for database search, visualization, comparison and benchmarking

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	Fast Optical Flow using Dense Inverse Search [pdf] [slide] Till Kroeger and Radu Timofte and Dengxin Dai and Luc Van Gool	ARXIV 2016 Kroeger2016ARXIV

Very low time complexity for dense optical flow
Inverse search for a uniform grid of patch correspondences
- Inverse Lukas-Kanade algorithm proposed before
Dense displacement field creation through patch aggregation along multiple scales
- Coarse-to-fine scheme
- Densification as weighted averaging to displacement estimates
Variational refinement
300Hz up to 600Hz on a single CPU core (human-level temporal resolution)
Evaluated on Sintel and KITTI benchmarks

Semantic Segmentation Road Segmentation
	Spatial Ray Features for Real-Time Ego-Lane Extraction [pdf] [slide] Kuehnl, T. and Kummert, F. and Fritsch, J.	IV 2012 Kuehnl2012IV

Road classification in in unconstrained environments
Extending local appearance-based road classification with a spatial feature generation and classification
Local properties from base classifiers on patches from monocular camera images
Output of classifiers represented in a metric confidence map
Spatial ray features (SPRAY) from these confidence maps
Final road-terrain classification based on local visual properties and their spatial layout
No an explicit lane model
In real-time with approximately 25 Hz on a GPU

Scene Understanding
	What's going on?: Discovering Spatio-Temporal Dependencies in Dynamic Scenes [pdf] [slide] Kuettel, Daniel and Breitenstein, Michael D. and Gool, Luc Van and Ferrari, Vittorio	CVPR 2010 Kuettel2010CVPR

Learning spatio-temporal dependencies of moving agents in complex dynamic scenes What are the typical actions in the scene? How do they relate to each other? What are the rules governing the scene?
Motivation: modelling
- correlated behaviours of multiple agents rather than independent agents
- spatial and temporal dependencies jointly
Local temporal rules: learning sequences of activities using Hierarchical Dirichlet Processes (HDP)
Global temporal rules: jointly learning co-occurring activities and their time dependencies using an arbitrary number of HMMs in HDP
Datasets: two videos of three hours in Zurich and two shorter videos of London

Reconstruction Reconstruction & Recognition
	Joint Semantic Segmentation and 3D Reconstruction from Monocular Video [pdf] [slide] Kundu, Abhijit and Li, Yin and Dellaert, Frank and Li, Fuxin and Rehg, JamesM.	ECCV 2014 Kundu2014ECCV

Presents a method for joint inference of both semantic segmentation and 3D reconstruction

Contributions:
- Introduces a novel higher order CRF model for joint inference of 3D structure and semantics in a 3D volumetric model
- The framework does not require dense depth measurements and utilizes semantic cues and 3D priors to enhance both depth estimation and scene parsing
- Presents a data-driven category-specific process for dynamically instantiating potentials in the CRF

Evaluates on monocular sequences such as CamVid and Leuven

Semantic Segmentation Semantic Segmentation with Multiple Frames
	Feature Space Optimization for Semantic Video Segmentation [pdf] [slide] Abhijit Kundu and Vibhav Vineet and Vladlen Koltun	CVPR 2016 Kundu2016CVPR

long-range spatio-temporal regularization in semantic video segmentation
Temporal regularization is challenging because of camera and scene motion
Optimize the position of pixels in a Euclidean feature space to minimize the distances between corresponding points
Structured prediction is performed by a dense CRF operating on the optimized features
Evaluation on CamVid and Cityscapes dataset and achieving state-of-the-art accuracy for semantic video segmentation

Reconstruction Stereo
	Fast and Accurate Large-scale Stereo Reconstruction using Variational Methods [pdf] [slide] Kuschk, Georg and Cremers, Daniel	ICCVWORK 2013 Kuschk2013ICCVWORK

Presents a fast algorithm for high-accuracy large-scale outdoor dense stereo reconstruction of man- made environments

Contributions:
- Proposes a structure-adaptive second-order Total Generalized Variation (TGV) regularization which facilitates the emergence of planar structures by enhancing the discontinuities along building facades
- Uses cost functions as data term which are robust to illumination changes arising in real world scenarios
- Instead of solving the optimization problem by a coarse-to-fine approach, proposes a quadratic relaxation which is solved by an augmented Lagrangian method
- This technique allows for capturing large displacements and fine structures simultaneously
- Experiments show that the proposed augmented Lagrangian formulation leads to a speedup by about a factor of 2

Evaluates on Middlebury, KITTI stereo datasets

Semantic Segmentation Road Segmentation
	Map-Supervised Road Detection [pdf] [slide] Ankit Laddha and Mehmet Kemal Kocamaz and Luis E. Navarro-Serment and Martial Hebert	IV 2016 AnkitLaddha2016IV

Proposes an approach to detect drivable road area in monocular images
Self-supervised approach which doesnt require any human road annotations on images to train the road detection algorithm

First, they automatically generate training drivable road area annotations for images using noisy OpenStreetMap data, vehicle pose estimation sensors (GPS and IMU) on the vehicle, and camera parameters
Next, they train a Convolutional Neural Network using these noisy labels for road detection
Outperforms all the methods which do not require human effort for image labeling

Evaluates on KITTI dataset

Reconstruction Multi-view 3D Reconstruction
	Structural Approach for Building Reconstruction from a Single DSM [pdf] [slide] Florent Lafarge and Xavier Descombes and Josiane Zerubia and Marc Pierrot Deseilligny	PAMI 2010 Lafarge2010PAMI

3D reconstruction of complex buildings and dense urban areas from a single Digital Surface Model (DSM)
Buildings as an assemblage of simple urban structures extracted from a library of 3D parametric blocks (like Lego pieces)
Steps:
- Extraction of 2D-supports of the urban structures (interactively or automatically)
- 3D-blocks are positioned on the 2D-supports using a Gibbs model
- MCMC sampler to find the optimal configuration of 3D-blocks associated with original proposition kernels
Validated in a wide resolution interval such as 0.7 m satellite and 0.1 m aerial DSMs

Reconstruction Reconstruction & Recognition
	A Hybrid Multiview Stereo Algorithm for Modeling Urban Scenes. [pdf] [slide] Lafarge, Florent and Keriven, Renaud and Bredif, Mathieu and Vu, Hoang-Hiep	PAMI 2013 Lafarge2013PAMI

Presents an original multi-view stereo reconstruction algorithm which allows the 3D-modeling of urban scenes as a combination of meshes and geometric primitives

Contributions:
- Hybrid modeling by generating meshes where primitives are then inserted or by detecting primitives and then meshing the unfitted parts of the scene
- The lack of information contained in the images is compensated by the introduction of urban knowledge in the stochastic model
- Efficient global optimization by performing the sampling of both 3D-primitives and meshes by a Jump-Diffusion based algorithm

Evaluates on Entry-P10, Herz-Jesu-P25 and Church datasets

Reconstruction Multi-view 3D Reconstruction
	Creating Large-Scale City Models from 3D-Point Clouds: A Robust Approach with Hybrid Representation [pdf] [slide] Florent Lafarge and Clement Mallet	IJCV 2012 Lafarge2012IJCV

Simultaneous 3D reconstruction of buildings, trees and topologically complex ground from point clouds
Classification of points into building, vegetation, ground or clutter
Geometric 3D primitives used for reconstruction of different classes
Arrangement scheme for parametric 3D-shapes allowing to impose structural constraints
Non-convex energy minimization with a parallelization scheme to reduce computational time
Tested on the Toronto Lidar scan samples and large DSM-based point clouds provided by the French Mapping Agency

Motion & Pose Estimation Simultaneous Localization and Mapping
	Visual SLAM for Autonomous Ground Vehicles [pdf] [slide] Henning Lategahn and Andreas Geiger and Bernd Kitt	ICRA 2011 Lategahn2011ICRA

Propose a dense stereo V-SLAM algorithm that estimates a dense 3D map representation which is more accurate than raw stereo measurements
Runs a sparse V- SLAM system, take the resulting pose estimates to compute a locally dense representation from dense stereo correspondences
Expresses this dense representation in local coordinate systems which are tracked as part of the SLAM estimate
The sparse part of the SLAM system uses sub mapping techniques to achieve constant runtime complexity most of the time

Evaluates on outdoor experiments of a car like robot.

Tracking State-of-the-Art on KITTI
	Multi-class Multi-object Tracking Using Changing Point Detection [pdf] [slide] Byungjae Lee and Enkhbayar Erdenee and SongGuo Jin and Mi Young Nam and Young Giu Jung and Phill-Kyu Rhee	ECCV 2016 Lee2016ECCV

Presents a robust multi-class multi-object tracking (MCMOT) formulated by a Bayesian filtering framework

Contributions:
- Departing from the likelihood estimation only for limited type of objects, CNN based object detector is used to compute the likelihoods of multiple object classes.
- Changing point detection is proposed for a tracking failure assessment by exploiting static observations as well as dynamic ones

Evaluates on video sequences from ImageNet VID and MOT benchmarks

Motion & Pose Estimation Ego-Motion Estimation
	Motion Estimation for Self-Driving Cars with a Generalized Camera [pdf] [slide] Gim Hee Lee and Friedrich Fraundorfer and Marc Pollefeys	CVPR 2013 Lee2013CVPR

Visual ego-motion estimation algorithm for self-driving car
Modeling multi-camera system as a generalized camera
Applying non-holonomic motion constraint of a car (Ackerman motion model)
Novel 2-point minimal solution for the generalized essential matrix
General case with at least one inter-camera correspondence and special case with only intra-camera correspondences
Efficient implementation within RANSAC for robust estimation
Comparison on a large real-world dataset with minimal overlapping field-of-views against GPS/INS ground truth

Motion & Pose Estimation Simultaneous Localization and Mapping
	Structureless pose-graph loop-closure with a multi-camera system on a self-driving car [pdf] [slide] Gim Hee Lee and Friedrich Fraundorfer and Marc Pollefeys	IROS 2013 Lee2013IROS

Proposes a method to compute the pose-graph loop-closure constraints using multiple overlapping field-of-views cameras mounted on a self-driving car

Contributions:
- Shows that the relative pose for the loop-closure constraint can be computed directly from the epipolar geometry of a multi-camera system
- Avoids the additional time complexities from the reconstruction of 3D scene points
- Provides greater flexibility in choosing a configuration for the multi-camera system to cover a wider field-of-view to avoid missing out any loop-closure opportunities

Evaluates on ParkingGarage01, ParkingGarage02 and Campu01 datasets

Motion & Pose Estimation Ego-Motion Estimation
	Relative Pose Estimation for a Multi-camera System with Known Vertical Direction [pdf] [slide] Gim Hee Lee and Marc Pollefeys and Friedrich Fraundorfer	CVPR 2014 Lee2014CVPR

Relative pose estimation of a multi-camera system with known vertical directions (known absolute roll and pitch angles)
Problems with the previous approaches:
- The high number of correspondences needed
- Identifying the correct solution from many solutions
- Strict assumption on the planarity of ground
Minimal 4-point and linear 8-point algorithms within RANSAC
4-point algorithm
- Hidden variable resultant method
- 8-degree univariate polynomial that gives up to 8 real solutions
Linear 8-point algorithm: an alternative solution for a degenerated case of SVD
Four fish-eye cameras fixed onto a car for ego-motion estimation
Evaluated on simulations and real-world datasets

Tracking Tracking with two cameras
	Dynamic 3D Scene Analysis from a Moving Vehicle [pdf] [slide] B. Leibe and N. Cornelis and K. Cornelis and L. Van Gool	CVPR 2007 Leibe2007CVPR

Presents an integrated system for dynamic scene analysis on a mobile platform

Contributions:
- Presents a multi-view/multi-category object detection module that can detect objects
- Shows how knowledge about the scene geometry can be used to improve recognition performance and to fuse the outputs of multiple detectors
- Demonstrates how 2D detections can be integrated over time to arrive at accurate 3D localization of static objects
- In order to deal with moving objects, proposes a tracking approach which formulates the tracking problem as space-time trajectory analysis followed by hypothesis selection.

Evaluates on 2 video sequence datasets introduced in the paper

Object Detection 2D Object Detection
	Robust Object Detection with Interleaved Categorization and Segmentation [pdf] [slide] B. Leibe and A. Leonardis and B. Schiele	IJCV 2008 Leibe2008IJCV

Proposes a method for learning the appearance and spatial structure of a visual object category in order to recognize novel objects of that category, localize them in cluttered real-world scenes, and automatically segment them from the background
Addresses object detection and segmentation not as separate entities, but as two closely collaborating processes
Presents a local-feature based approach that combines both capabilities into a common probabilistic framework

Initial recognition phase initializes the top-down segmentation process with a possible object location
segmentation permits the recognition stage to focus its effort on object pixels and discard misleading influences from the background
Uses segmentation in turn to improve recognition

Evaluates on UIUC Cars, CalTech Cars,TUD Motorbikes, VOC05 Motorbikes, Leeds Cows, TUD Pedestrians datasets

Tracking
	Coupled Detection and Tracking from Static Cameras and Moving Vehicles [pdf] [slide] B. Leibe and K. Schindler and N. Cornelis and L. Van Gool	PAMI 2008 Leibe2008PAMI

Builds an integrated system for dynamic 3D scene analysis from a moving platform
Presents a novel approach for multi-object tracking integrating recognition, re-construction & tracking in a collaborative framework

Contributions:
- Uses SfM to estimate scene geometry at each time step
- Uses recognition to pick out objects of interest & separate them from the dynamically changing background
- Uses the output of multiple single-view object detectors & integrates continuously reestimated scene geometry constraints
- Uses tracking for temporal context to individual object detections

Evaluates on 2 video sequence datasets introduced in the paper

Tracking State-of-the-Art on KITTI
	FollowMe: Efficient Online Min-Cost Flow Tracking with Bounded Memory and Computation [pdf] [slide] Philip Lenz and Andreas Geiger and Raquel Urtasun	ICCV 2015 Lenz2015ICCV

Limitations of min-cost flow formulations for tracking-by-detection (eg, Nevatia):
- Require whole video as batch (no online computation)
- Scale badly in memory and computation
Contributions:
- Dynamic successive shortest path algorithm & extension to online processing
- Approximate solver with bounded memory and computation
Evaluation on KITTI 2012 and PETS 2009 benchmarks

Motion & Pose Estimation Simultaneous Localization and Mapping
	Keyframe-Based Visual-Inertial SLAM using Nonlinear Optimization [pdf] [slide] Stefan Leutenegger and Paul Timothy Furgale and Vincent Rabaud and Margarita Chli and Kurt Konolige and Roland Siegwart	RSS 2013 Leutenegger2013RSS

A joint non-linear cost function to optimize an IMU error + landmark reprojection error in a fully probabilistic manner
Non-linear optimization approaches vs. filtering schemes
Tightly coupled vs. loosely coupled approaches for visual-inertial fusion
Marginalization of old states to maintain a bounded-sized optimization window for real-time performance
A fully probabilistic derivation of IMU error terms, including the respective information matrix
Building a pose graph without expressing global pose uncertainty
Both the hardware and the algorithms for accurate real-time SLAM, including robust keypoint matching and outlier rejection using inertial cues
Evaluated using a stereo-camera/IMU setup

Representations Stixels
	StixelNet: A Deep Convolutional Network for Obstacle Detection and Road Segmentation [pdf] [slide] Dan Levi and Noa Garnett and Ethan Fetaya	BMVC 2015 Levi2015BMVC

Obstacle avoidance for mobile robotics and autonomous driving
Detection of the closest obstacle in each direction from a driving vehicle using single color camera
Reduction of the problem in a column-wise regression problem solved with a deep CNN
- Divide the image into columns
- For each column the network estimates the pixel location of the bottom point of the closest obstacle
Loss function based on a semi-discrete representation of the obstacle position probability
Trained with ground truth generated from laser-scanner point cloud
Outperforms existing camera-based methods including ones using stereo on KITTI
Achieving among the best results for road segmentation on KITTI

Tracking State-of-the-Art on KITTI
	Joint Graph Decomposition and Node Labeling by Local Search [pdf] [slide] Evgeny Levinkov and Siyu Tang and Eldar Insafutdinov and Bjoern Andres	ARXIV 2016 Levinkov2016ARXIV

States the minimum cost node labeling lifted multicut problem, NL-LMP, an NP-hard combinatorial optimization problem whose feasible solutions define both a decomposition and a node labeling of a given graph.
Defines & implements two local search algorithms that converge monotonously to a local optimum, offering a feasible solution at any time.
Shows applications of these algorithms to the task of articulated human body pose estimation & to the task of multiple object tracking

Evaluates on MPII Multi-Person benchmark and MOT16 for multi-object tracking benchmark

Object Detection 3D Object Detection from 3D Point Clouds
	Vehicle Detection from 3D Lidar Using Fully Convolutional Network [pdf] [slide] Bo Li and Tianlei Zhang and Tian Xia	RSS 2016 Li2016RSS

Transferring fully convolutional network techniques to the vehicle detection task from the range data of Velodyne Lidar
Representing the data in a 2D point map
Using single 2D end-to-end fully convolutional network to predict the objectness confidence and bounding box simultaneously
Bounding box encoding allows to predict full 3D bounding boxes even with 2D CNN
State-of-the-art performance on KITTI dataset

Motion & Pose Estimation Localization
	Worldwide Pose Estimation using 3D Point Clouds [pdf] [slide] Yunpeng Li and Noah Snavely and Dan Huttenlocher and Pascal Fua	ECCV 2012 Li2012ECCV

Addresses the problem of determining where a photo was taken by estimating a full 6-DOF-plus-intrincs camera pose with respect to a large geo-registered 3D point cloud

Contributions:
- Observes that 3D points produced by SfM methods often have strong co-occurrence relationships
- Finds such statistical co-occurrences by analyzing the large numbers of images in 3D SfM models
- Presents a bidirectional matching scheme aimed at boosting the recovery of true correspondences between image features and model points

Evaluates on Landmarks, San Francisco, Quad datasets

Scene Understanding Indoor 3D Scene Understanding
	Holistic Scene Understanding for 3D Object Detection with RGB-D Cameras [pdf] [slide] Dahua Lin and Sanja Fidler and Raquel Urtasun	ICCV 2013 Lin2013ICCV

Indoor scene understanding using RGBD data (category level 3D object detection)
By exploiting 2D segmentation, 3D geometry, and contextual relations
Modelling both appearance and depth rather than monocular setting
3D cuboids as hypotheses in point clouds
Ranking according to objectness in appearance
Inference in a CRF to model the contextual relationships between objects, and scenes and objects
Improves upon state-of-the-art on NYU v2 dataset

Semantic Segmentation Semantic Segmentation of Aerial Images
	Efficient Piecewise Training of Deep Structured Models for Semantic [pdf] [slide] Guosheng Lin and Chunhua Shen and Ian D. Reid and Anton van den Hengel	CVPR 2016 Lin2016CVPR

Semantic segmentation using contextual information
Patch-patch context: Piecewise training of CRFs with CNN-based unary and pairwise potentials (connecting every patch with surrounding, above/below relations)
Patch-background context: multi-scale image input, sliding pyramid pooling
Prediction: first coarse-level (CRF), and then refinement (Dense CRF)
New state-of-the-art on NYUDv2, PASCAL VOC 2012, PASCAL-Context, and SIFT-flow datasets

Motion & Pose Estimation Localization
	Cross-View Image Geolocalization [pdf] [slide] Tsung-Yi Lin and Serge J. Belongie and James Hays	CVPR 2013 Lin2013CVPR

Current approach to image geolocalization problem:
- By matching the query image to a database of georeferenced photographs
- Only works for famous landmarks, but not for the unremarkable scenes
Relationship between aerial view and ground-level data
Overhead appearance and land cover survey data
- Densely available for nearly all of the Earth
- Rich enough for unambiguous matching
A cross-view feature translation approach
A new dataset with ground-level, aerial, and land cover attribute images for training
An aerial image classifier based on ground level scene matches
Output of a query: a probability density over the region of interest
Experiments over a 1600 km^2 region containing a variety of scenes and land cover types

Motion & Pose Estimation Localization
	Learning deep representations for ground-to-aerial geolocalization [pdf] [slide] Tsung-Yi Lin and Yin Cui and Serge J. Belongie and James Hays	CVPR 2015 Lin2015CVPR

Presents the first general technique for the challenging problem of matching street-level and aerial view images and evaluated it for the task of image geolocalizaiton.

Contributions:
- Localizes a photo without using ground-level reference imagery by matching to aerial imagery
- Presents a novel method to create a large-scale cross-view training dataset from public data sources
- Examine traditional computer vision features and several recent deep learning strategies in novel cross-domain learning task

Evaluates on new introduced dataset of pairs of Google street-view images and their corresponding aerial images

Datasets & Benchmarks Real Data
	Microsoft COCO: Common Objects in Context [pdf] [slide] Tsung-Yi Lin and Michael Maire and Serge Belongie and James Hays and Pietro Perona and Deva Ramanan and Piotr Dollar and C. Lawrence Zitnick	ECCV 2014 Lin2014ECCV

New dataset to advance state-of-the-art in object recognition, segmentation and captioning
Collection of images of complex everyday scenes containing common objects in their natural context
Objects are labeled using per-instance segmentations
Dataset contains photos of 91 objects types with a total of 2.5 million labeled instances in 328k images
Extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation
Detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet and Sun
Baseline performance analysis for bounding box and segmentation detection using Deformable Parts Model

Scene Understanding
	Single-View 3D Scene Parsing by Attributed Grammar [pdf] [slide] Liu, Xiaobai and Zhao, Yibiao and Zhu, Song-Chun	CVPR 2014 Liu2014CVPR

Single image depth estimation by using a pool of images for which the depth is known
- A non-parametric approach to retrieve similar images
Formulated as discrete-continuous optimization problem
- Continuous: depth of the superpixels
- Discrete: relationships between neighboring superpixels (junction potentials to encode occlusions, and smoothness constraints)
Inference in a higher order graphical model using particle belief propagation
- Unary computed by making use of the images with known depth
- Already state-of-the-art with only unary
Experiments on both the indoor (NYU v2) and outdoor (Make3D) scenarios

Motion & Pose Estimation Localization
	Visual Place Recognition: A Survey [pdf] [slide] Stephanie M. Lowry and Niko Sunderhauf and Paul Newman and John J. Leonard and David D. Cox and Peter I. Corke and Michael J. Milford	TR 2016 Lowry2016TR

A comprehensive review of the current state of place recognition research, including its relationship with SLAM, localization, mapping, and recognition
Introducing the concepts behind place recognition
- The role of place recognition in the animal kingdom
- How a "place" is defined in a robotics context
- The major components of a place recognition system
Discussing how place recognition solutions can implicitly or explicitly account for appearance change within the environment
A discussion on the future of visual place recognition with respect to advances in deep learning, semantic scene understanding, and video description

Reconstruction Stereo
	Efficient Deep Learning for Stereo Matching [pdf] [slide] Luo, W. and Schwing, A. and Urtasun, R.	CVPR 2016 Luo2016CVPR

Siamese networks for stereo perform well but are slow
They propose a very fast matching network
- Product layer between the siamese networks instead of concatenation
- Consider multi-class classification problem with the possible disparities as classes
- Calibrated scores allow to outperform existing approaches
- Consider several MRFs for smoothing the matching results (cost aggregation, semi global block matching and slanted plane)
Evaluation on KITTI 2012 and 2015 benchmarks

Motion & Pose Estimation 3D Motion Estimation -- Scene Flow
	A Continuous Optimization Approach for Efficient and Accurate Scene Flow [pdf] [slide] Lv, Zhaoyang and Beall, Chris and Alcantarilla, Pablo and Li, Fuxin and Kira, Zsolt and Dellaert, Frank	ECCV 2016 Lv2016ECCV

Motion & Pose Estimation Localization
	Get Out of My Lab: Large-scale, Real-Time Visual-Inertial Localization [pdf] [slide] Simon Lynen and Torsten Sattler and Michael Bosse and Joel A. Hesch and Marc Pollefeys and Roland Siegwart	RSS 2015 Lynen2015RSS

Demonstrates that large-scale, real-time pose estimation and tracking can be performed on mobile platforms with limited resources without the use of an external server

Contributions:
- Proposes a large-scale system that entirely runs on devices with limited computational & memory resources while offering accurate, real-time localization
- Proposes a direct inclusion of 2D-3D matches from global localization into the local visual-inertial estimator
- Leads to smoother trajectories & faster run-times compared to sliding window Bundle Adjustment

Evaluates on dataset introduced in the paper

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	Learning a Confidence Measure for Optical Flow [pdf] [slide] Oisin Mac Aodha and Ahmad Humayun and Marc Pollefeys and Gabriel J. Brostow	PAMI 2013 MacAodha2013PAMI

Presents a supervised learning based method to estimate a per-pixel confidence for optical flow vectors

Contributions:
- Evaluates the proposed optical flow confidence measure on new flow algorithms & several new sequences
- Compares to other confidence measures
- Proposes separate confidence in X and Y directions
- improves accuracy for optical flow by automatically combining known constituent algorithms

Evaluates on Middlebury sequences and synthetic sequences introduced in the paper

Datasets & Benchmarks Real Data
	1 Year, 1000km: The Oxford RobotCar Dataset [pdf] [slide] Will Maddern and Geoff Pascoe and Chris Linegar and Paul Newman	IJRR 2016 Maddern2016IJRR

Semantic Segmentation Semantic Segmentation of Aerial Images
	High-Resolution Semantic Labeling with Convolutional Neural Networks [pdf] [slide] Emmanuel Maggiori and Yuliya Tarabalka and Guillaume Charpiat andP Pierre Alliez	ARXIV 2016 Maggiori2016ARXIV

Dense semantic labeling, assigning a semantic label to every pixel, using CNN
High spatial accuracy can not directly be achieved with categorization CNNs
In-depth analysis of categorization networks for semantic labeling
Establish desired properties of ideal semantic labeling CNN and asses how those methods stand to these properties
Derivation of a CNN framework specifically adapted to semantic labeling problem
Learning features at different resolutions and efficiently combine local and global information
Evaluated on Vaihingen and Potsdam, provided by Commission III of the ISPRS

Semantic Segmentation Road/Lane Detection
	Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs [pdf] [slide] Vikash Mansinghka and Tejas Kulkarni and Yura Perov and Josh Tenenbaum	NIPS 2013 Mansinghka2013NIPS

Computer vision as Bayesian inverse problem to computer graphics has proved difficult to directly implement
Short, simple probabilistic graphics programs that define flexible generative models and automatically invert them to interpret real-world images
Generative probabilistic graphics programs consist of a stochastic scene generator, a renderer based on graphics software and a stochastic likelihood model
Stochastic likelihood model links the renderer's output and the data
Latent variables adjust the fidelity of the renderer and the tolerance of the likelihood
Automatic Metropolis-Hastings transition operators are used to invert the probabilistic graphics programs
Demonstration on reading sequence of degraded and adversarially obscured characters and inferring 3D road models (KITTI dataset)

Semantic Segmentation Semantic Segmentation of Aerial Images
	Classification With an Edge: Improving Semantic Image Segmentation with Boundary Detection [pdf] [slide] Dimitrios Marmanis and Konrad Schindler and Jan Dirk Wegner and Silvano Galliani and Mihai Datcu and Uwe Stilla	ARXIV 2016 Marmanis2016ARXIV

Semantic segmentation of high-resolution aerial images using boundaries
DCNNs: Contextual information over very large windows
Problem: Loss of spatial resolution, blurry object boundaries
Adding boundary detection to SEGNET encoder-decoder architecture, FCN-type models
- Boundary likelihoods as an additional channel
Ensemble prediction with SEGNET, VGG and FCN
Boundary detection improves semantic segmentation with CNNs.
>90 overall accuracy on the ISPRS Vaihingen bechmark

Semantic Segmentation Semantic Segmentation of Aerial Images
	Semantic Segmentation of Aerial Images with An Ensemble of Cnns [pdf] [slide] Marmanis, D. and Wegner, J.~D. and Galliani, S. and Schindler, K. and Datcu, M. and Stilla, U.	APRS 2016 Marmanis2016APRS

Presents an end-to-end semantic segmentation deep learning approach of very high resolution aerial images

Contributions:
- Designs a Fully Convolution Network which takes as input intensity and range data
- Converts early network layers into a pixelwise classification at full resolution with the help of aggressive deconvolution
- Demonstrates that an ensemble of several networks achieves better results

Evaluates on ISPRS semantic labeling benchmark

Semantic Segmentation Semantic Segmentation of Facades
	3D All The Way: Semantic Segmentation of Urban Scenes from Start to End in 3D [pdf] [slide] Anjelo Martinovic and Jan Knopp and Hayko Riemenschneider and Luc Van Gool	CVPR 2015 Martinovic2015CVPR

Semantic segmentation of 3D city models
Starting from an SfM reconstruction, classification and facade modelling purely in 3D
No need for slow image-based semantic segmentation methods
High quality labellings, with significant speed benefits (20times faster, entire streets in a matter of minutes)
Combining a state-of-the-art 2D classifier: further boosting the performance (slower)
A novel facade separation based on the results of semantic facade analysis
3D-specific principles like alignment, symmetry in a framework optimized using integer quadratic programming formulation
Evaluated on Rue-Monge2014

Semantic Segmentation Semantic Segmentation of Facades
	ATLAS: A Three-Layered Approach to Facade Parsing [pdf] [slide] Markus Mathias and Andelo Martinovic and Luc Van Gool	IJCV 2016 Mathias2016IJCV

Semantic segmentation of building facades
Three distinct layers representing different levels of abstraction:
- Segmentation into regions with probability distribution over semantic classes
- Detect objects to improve initial labeling with object detector
- Combination of segmentation and object detection with a CRF
- Incorporate additional meta-knowledge in form of weak architectural principles which enforces architectural plausibility
Outperform state-of-the-art on ECP and eTRIMS dataset
Output of highest layer used for procedural building reconstruction

Datasets & Benchmarks Real Data
	HD Maps: Fine-Grained Road Segmentation by Parsing Ground and Aerial Images [pdf] [slide] Mattyus, Gellert and Wang, Shenlong and Fidler, Sanja and Urtasun, Raquel	CVPR 2016 Mattyus2016CVPR

Fine-grained segmentation for fully autonomous systems parking spots, side-walk, background, number and location of road lanes
Alternatives:
- Many man-hours of laborious and tedious labelling
- Imagery/LIDAR from millions of cars
Using monocular aerial imagery, topology of the road network from OpenStreetMap, and stereo images taken from a camera on top of a car
Accurate alignment between two types of imagery
A set of potentials exploiting semantic cues, road constraints, relationships between parallel roads, and smoothness assumptions
Enhancing KITTI with aerial images: Air-Ground-KITTI
Significantly reduced alignment error compared to a GPS+IMU system

Semantic Segmentation Semantic Segmentation of Aerial Images
	Enhancing Road Maps by Parsing Aerial Images Around the World [pdf] [slide] Gellert Mattyus and Shenlong Wang and Sanja Fidler and Raquel Urtasun	ICCV 2015 Mattyus2015ICCV

Exploit aerial images in order to enhance freely available world maps (eg, with road geometry)
Formulation as inference in a Markov random field
Parametrized in terms of the location of road-segment centerlines and width
Parametrization allows efficient inference and returns only topologically correct roads
Energy encodes the appearance of roads, edge information, car detection, contextual features, relations between nearby roads as well as smoothness between the line segments
All OpenStreetMaps roads in the whole world can be segmented in a single day using small cluster of 10 computers
Good generalization: can be trained using only 1.5km^2 aerial imagery and produce very accurate results in any location across the world
Outperforming state-of-the-art on two novel benchmarks

Datasets & Benchmarks Synthetic Data
	A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation [pdf] [slide] N. Mayer and E. Ilg and P. Haeusser and P. Fischer and D. Cremers and A. Dosovitskiy and T. Brox	CVPR 2016 Mayer2016CVPR

Introduces a synthetic dataset containing over 35000 stereo image pairs with ground truth disparity, optical flow, and scene flow
Synthetic dataset suite consists of three subsets
- FlyingThings3D is 25000 stereo frames with ground truth data of everyday objects flying along randomized 3D trajectories
- Monkaa contains nonrigid and softly articulated motion as well as visually challenging fur, made from the open source Blender assets of the animated short film Monkaa
- The Driving dataset is comprises naturalistic, dynamic street scenes from the viewpoint of a driving car, made to resemble the KITTI datasets
Demonstrates that the dataset can indeed be used to successfully train large convolutional networks

Semantic Segmentation Semantic Segmentation of 3D Data
	SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks [pdf] [slide] John McCormac and Ankur Handa and Andrew J. Davison and Stefan Leutenegger	ARXIV 2016 McCormac2016ARXIV

Existing methods don't exploit fact that outdoor scenes can be decomposed into small number of independently moving 3D objects

Pipeline is composed of three separate units:
- A real-time SLAM system ElasticFusion to provide correspondences between frames, and a globally consistent map of fused surfels
- A Convolutional Neural Network recieves a 2D image (RGBD), and returns a set of per pixel class probabilities
- Bayesian update scheme to update the class probability distribution for each surfel, obtained from the CNNs predictions using the correspondences provided by the SLAM system

Evaluates on NYUv2 dataset

Cameras Models & Calibration Omnidirectional Cameras
	Single View Point Omnidirectional Camera Calibration from Planar Grids [pdf] [slide] C. Mei and P. Rives	ICRA 2007 Mei2007ICRA

Flexible approach for calibrating omnidirectional single viewpoint sensors from planar grids
Based on exact theoretical projection function with added well identified parameters to model real-world errors
Reduce large number of parameters necessary for Gonzalez-Barbosa method using the assumption that the errors are small due to the assembly of the system
Using the unified model of Barreto-Geyer to obtain a calibration valid for all central catadioptric systems
Selection of only four points necessary for the initialization of each calibration grid
Validation with calibration of parabolic, hyperbolic, folded mirror, wide-angle and spherical sensors

Datasets & Benchmarks Real Data
	Object Scene Flow for Autonomous Vehicles [pdf] [slide] Moritz Menze and Andreas Geiger	CVPR 2015 Menze2015CVPR

Existing methods don't exploit fact that outdoor scenes can be decomposed into small number of independently moving 3D objects
Absence of realistic benchmarks with scene flow ground truth

Contributions:
- Exploits the decomposition of the scene as collection of rigid objects
- Reasoning jointly about this decomposition as well as the geometry and motion of objects in the scene
- Introduces the first realistic and large-scale scene flow dataset

Evaluates on stereo and flow KITTI benchmarks

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	Discrete Optimization for Optical Flow [pdf] [slide] Moritz Menze and Christian Heipke and Andreas Geiger	GCPR 2015 Menze2015GCPR

Optical flow as a discrete inference problem in a CRF, followed by sub-pixel refinement
Diverse (500) flow proposals by approximate nearest neighbour search based on appearance (Daisy), and by respecting NMS constraints
Pre-computation of truncated pairwise potentials, further accelerated via hashing
BCD by iteratively updating alternating image rows and columns
Post-processing as forward backward consistency check and removing small segments
Epic Flow for interpolation
Evaluated on Sintel and KITTI benchmarks

Motion & Pose Estimation 3D Motion Estimation -- Scene Flow
	Joint 3D Estimation of Vehicles and Scene Flow [pdf] [slide] Moritz Menze and Christian Heipke and Andreas Geiger	ISA 2015 Menze2015ISA

Existing slanted plane models for scene flow estimation only reason about segmentation and the motion of the vehicles in the scene

Contributions:
- Jointly reasons about 3D scene flow as well as the pose, shape and motion of vehicles in the scene
- Incorporates a deformable CAD model into a slanted-plane CRF for scene flow estimation
- Enforces shape consistency between the rendered 3D models and the superpixels in the image

Evaluates on scene flow benchmark on KITTI

Reconstruction Reconstruction & Recognition
	Piecewise planar city 3D modeling from street view panoramic sequences. [pdf] [slide] Micusik, Branislav and Kosecka, Jana	CVPR 2009 Micusik2009CVPR

Unified framework for creating 3D city models
Exploiting image segmentation cues, dominant scene orientations and piecewise planar structures
Pose estimation with a modified SURF-based matching approach to exploit properties of the panoramic camera
Multi-view stereo method that operates directly on panoramas while enforcing the piecewise planarity constraint in the sweeping stage
Depth fusion method which exploits the constraints of urban environments combines advantages from volumetric- and viewpoint-based fusion
Avoids expensive voxelization of space and operates directly on 3D reconstructed points through effective kd-tree
Final surface by tessellation of backprojections of the points into the reference image
Demonstration on two street-view sequences, only qualitative results

Datasets & Benchmarks Real Data
	MOT16: A Benchmark for Multi-Object Tracking [pdf] [slide] Anton Milan and Laura Leal-Taixe and Ian D. Reid and Stefan Roth and Konrad Schindler	ARXIV 2016 Milan2016ARXIV

Standardized benchmark for Multi-Object tracking
New releases of MOTChallenge
Unlike the initial release
- Carefully annotated by researchers following a consistent protocol
- Significant increase in the number of labeled boxes, 3 times more targets
- Multi object classes besides pedestrians
- Visibility for every single object of interest

Tracking
	Continuous Energy Minimization for Multitarget Tracking [pdf] [slide] Milan, A. and Roth, S. and Schindler, K.	PAMI 2014 Milan2014PAMI

Contributions:
- Proposes an energy that corresponds to a more complete representation of the problem, rather than one that is amenable to global optimization
- Besides the image evidence, the energy function takes into account physical constraints, such as target dynamics, mutual exclusion, and track persistence
- Constructs a optimization scheme that alternates between continuous conjugate gradient descent and discrete trans-dimensional jump moves

Evaluates on sequences from VS-PETS 2009/2010, TUD-Stadtmitte benchmarks

Tracking
	Detection- and Trajectory-Level Exclusion in Multiple Object Tracking [pdf] [slide] Anton Milan and Konrad Schindler and Stefan Roth	CVPR 2013 Milan2013CVPR

Tracking multiple targets in crowded scenarios
Modelling mutual exclusion between distinct targets both at the data association and at the trajectory level
Using a mixed discrete-continuous CRF
- Exclusion between conflicting observations with supermodular pairwise terms
- Exclusion between trajectories with pairwise global label costs
A statistical analysis of ground-truth trajectories for modelling data fidelity, target dynamics, and inter-target occlusion
An expansion move-based optimization scheme
Evaluated on the PETS S2.L1, and four more sequences from PETS benchmark, TUD-Stadtmitte, and Bahnhof, Sunny Day sequences from ETH Mobile Scene dataset

Motion & Pose Estimation Ego-Motion Estimation
	Fast Techniques for Monocular Visual Odometry [pdf] [slide] Mohammad Hossein Mirabdollah and Barbel Mertsching	GCPR 2015 Mirabdollah2015GCPR

Real-time and robust monocular visual odometry
Iterative 5-point method to estimate initial camera motion parameters within RANSAC
Landmark localization with uncertainties using a probabilistic triangulation method
Robust tracking of low quality features on ground planes to estimate scale of motion
Minimization of a cost function:
- Epipolar geometry constraints for far landmarks
- Projective constraints for close landmarks
Real-time due to iterative estimation of only the last camera pose (landmark positions from probabilistic triangulation method)
Evaluated on KITTI visual odometry dataset

Motion & Pose Estimation Ego-Motion Estimation
	On the Second Order Statistics of Essential Matrix Elements [pdf] [slide] Mohammad Hossein Mirabdollah and Barbel Mertsching	GCPR 2014 Mirabdollah2014GCPR

Relative monocular camera motion estimation based on the coplanarity constraint
8-point methods have poor performance in the presence of noise
Investigation of the second order statistics of essential matrix elements
Using Taylor expansion for a rotation matrix up to second order terms a covariance matrix is obtained
Covariance matrix is utilized along with the coplanarity equations and acts as regularization term
Considerable improvements in the recovery of the camera motion
Evaluation based on simulation and on the KITTI dataset for visual odometry

Tracking
	Taking Mobile Multi-object Tracking to the Next Level: People, Unknown Objects, and Carried Items [pdf] [slide] Dennis Mitzel and Bastian Leibe	ECCV 2012 Mitzel2012ECCV

Mobile multi-object tracking in challenging street scenes
Tracking-by-detection limits to object categories of pre-trained detector models
Tracking-before-detection approach that can track known and unknown object categories
Noisy stereo depth data used to segment and track objects in 3D
Novel, compact 3D representation allows to track robustly large variety of objects while building up models of their 3D shape online
Comparison of the representation with a learned statistical shape template allows to detect anomalous shapes such as carried items
Evaluation on several challenging video sequences of busy pedestrian zones, the BAHNHOF and SUNNY DAY dataset ¹

^{1. Ess, A., Leibe, B., Schindler, K., Van Gool, L.: Robust Multi-Person Tracking from a Mobile Platform. PAMI 31(10), 18311846 (2009)}

Semantic Segmentation Road Segmentation
	Deep Deconvolutional Networks for Scene Parsing [pdf] [slide] Rahul Mohan	ARXIV 2014 Moh2014ARXIV

Labeling each pixel in an image with the category it belongs to
Using raw pixels instead of superpixels
Combine deep deconvolutional neural networks with CNNs
Multi patch training makes it possible to effectively learn spatial priors from scenes
End-to-end training system without requiring post-processing
Evaluated on Stanford Background, SIFT Flow, CamVid, and KITTI

Semantic Segmentation Semantic Segmentation of Aerial Images
	Semantic segmentation of aerial images in urban areas with class-specific higher-order cliques [pdf] [slide] J. Montoya and J. D. Wegner and L. Ladicky and K. Schindler	CPIA 2015 Montoya2015CPIA

Semantic segmentation of urban areas in high-resolution aerial images
Highly heterogeneous object appearances and shape
Using high-level shape representations as class-specific object priors
- Buildings by sets of compact polygons
- Roads as a collection of long, narrow segments ¹
Pixel-wise classifier to learn local co-occurrence patterns
Hypotheses generation for possible road segments and segments of buildings in a data-driven manner
Inference in a CRF with higher-order potentials
Accuracies of > 80 on Vaihingen dataset

^{1. Mind the Gap: Modeling Local and Global Context in (Road) Networks, GCPR 2014}

Semantic Segmentation Semantic Segmentation of Aerial Images
	Mind the Gap: Modeling Local and Global Context in (Road) Networks [pdf] [slide] Javier A. Montoya-Zegarra and Jan Dirk Wegner and Lubor Ladicky and Konrad Schindler	GCPR 2014 Montoya-Zegarra2014GCPR

Road labeling in aerial images and extraction of a topologically correct road network
Model rich and complicated contextual information at two levels
- Locally, the context and layout of roads is learned implicitly including multi-scale appearance information
- Globally, the network structure is enforced explicitly
Detect promising stretches of road via shortest-path search on per pixel evidence
Select pixels on an optimal subset of the paths by energy minimization in a high-order CRF
Outperforms several baselines on two challenging data sets Graz and Vaihingen in precision and topological correctness

Cameras Models & Calibration Event Cameras
	Lifetime estimation of events from Dynamic Vision Sensors [pdf] [slide] Elias Mueggler and Christian Forster and Nathan Baumli and Guillermo Gallego and Davide Scaramuzza	ICRA 2015 Mueggler2015ICRA

Estimating the "life-time" of events from retinal cameras
Dynamic Vision Sensor (DVS)
- Transmitting only pixel-level brightness changes ("events") at the time they occur with micro-second resolution
- Low latency and sparse output: suitable for high-speed mobile robotic applications
Stream of augmented events from the event's velocity on the image plane
A continuous representation of events in time rather than the accumulation of events over fixed, artificially-chosen time intervals
An event-based, robust plane fitting algorithm with minimum latency (by considering only past events in the neighborhood of the current event) and optional regularization
Evaluated in controlled environments, urban settings, high-speed quadrotor flips
Compared to standard visualization methods for the rendering of sharp gradient images at any time instant

Cameras Models & Calibration Event Cameras
	Continuous-Time Trajectory Estimation for Event-based Vision Sensors [pdf] [slide] Elias Mueggler and Guillermo Gallego and Davide Scaramuzza	RSS 2015 Mueggler2015RSS

Ego-motion estimation for an event-based vision sensor using a continuous-time framework
Directly integrating the information conveyed by the sensor
Pose trajectory is approximated by a smooth curve using cubic splines in the space of rigid-body motions
Optimization according a geometrically meaningful error measure in the image plane to the observed events
Evaluation on datasets acquired from sensor-in-the-loop simulations and onboard a quadrotor performing flips with ground truth

Semantic Segmentation Road Segmentation
	Stacked Hierarchical Labeling [pdf] [slide] Daniel Munoz and J. Andrew Bagnell and Martial	ECCV 2010 Munoz2010ECCV

Hierarchical approach for labeling semantic objects and regions in scenes
Using a decomposition of the image in order to encode relational and spatial information
Directly training a hierarchical inference procedure inspired by message passing
Breaking the complex inference problem into a hierarchical series of simple subproblems
Each subproblem is designed to capture the image and contextual statistics in the scene
Training in sequence to ensure robustness to likely errors earlier in the inference sequence
Evaluation on MSRC-21 and Stanford Background datasets

Motion & Pose Estimation Ego-Motion Estimation
	ORB-SLAM: A Versatile and Accurate Monocular SLAM System [pdf] [slide] Raul Mur-Artal and J. M. M. Montiel and Juan D. Tardos	TR 2015 Mur-Artal2015TR

Proposes a feature-based monocular SLAM system that operates in real time, in small and large, indoor and outdoor environments

Contributions:
- Uses same features for all tasks: tracking, mapping, relocalization and loop closing
- Real time operation in large environments
- Real time loop closing based on the optimization of a pose graph
- Real time camera relocalization with significant invariance to viewpoint and illumination
- New initialization procedure based on model selection
- A survival of the fittest approach to map point and keyframe selection

Evaluates on sequences from NewCollege, TUM RGB-D and KITTI datasets

Reconstruction Multi-view 3D Reconstruction
	A Survey of Urban Reconstruction [pdf] [slide] Przemyslaw Musialski and Peter Wonka and Daniel G. Aliaga and Michael Wimmer and Luc J. Van Gool and Werner Purgathofer	CGF 2013 Musialski2013CGF

Challenges - Full automation, Quality & scalability, data acquisition constraints
Point Clouds & Cameras - introduce the Fundamentals of Stereo Vision, provides the key concepts of image-based automatic Structure from Motion methodology, and Multi-View Stereo approaches
Buildings & Semantics - Approaches which aim at reconstructing whole buildings from various input sources, such as a set of photographs or laser-scanned points, typically by fitting some parametrised top-town building model
Facades & Images - Approaches aiming at the reconstruction and representation of facades
Blocks & Cities - The problem of measuring and documenting the world is the objective of the photogrammetry and remote sensing community

Motion & Pose Estimation Localization
	Map-based priors for localization [pdf] [slide] Sang Min Oh and Sarah Tariq and Bruce N. Walker and Frank Dellaert	IROS 2004 Oh2004IROS

Map-based priors for localization using the semantic information available in maps
Biases the motion model towards areas of higher probability
Easily incorporated in the particle filter by means of a pseudo likelihood under a particular assumption
Localization with noisy sensors results in far more stable local tracking
Experimental results on a GPS-based outdoor people tracker

Semantic Segmentation Road Segmentation
	Efficient Deep Methods for Monocular Road Segmentation [pdf] [slide] Gabriel Oliveira and Wolfram Burgard and Thomas Brox	IROS 2016 Oliveira2016IROS

An incremental 3D representation from 3D range measurements
Macro scale polygonal primitives vs. micro scale primitives (not compact)
Motivation:
- Processing large amounts of 3D data
- Large number of well defined geometric structures
Reconstruction of large scale scenarios
Update of geometric polygonal primitives over time with fresh sensor data
Accurate, compact, and efficient descriptions of the scene
Evaluated on a data-set from MIT, taken from their participation in the DARPA Urban Challenge

Scene Understanding
	Incremental scenario representations for autonomous driving using geometric polygonal primitives [pdf] [slide] Viviane M. de Oliveira and Vtor Santos and Angel Domingo Sappa and Paulo Dias and A. Paulo Moreira	RAS 2016 Oliveira2016RAS

Incremental 3D representation of a scene from continuous stream of 3D range sensor
Using Macro scale polygonal primitives to model the scene
Representation of the scene is a list of large scale polygons describing the geometric structure
Approach to update the geometric polygonal primitives over time using fresh sensor data
Produces accurate descriptions of the scene and is computationally very efficient compared to other reconstruction methods
Evaluation on a dataset from the MIT team taken in the DARPA Urban Challenge

Semantic Segmentation Semantic Segmentation of Aerial Images
	Effective semantic pixel labelling with convolutional networks and Conditional Random Fields [pdf] [slide] Sakrapee Paisitkriangkrai and Jamie Sherrah and Pranam Janney and Anton van den Hengel	CVPRWORK 2015 Paisitkriangkrai2015CVPRWORK

Effective semantic pixel labelling for aerial imagery
Using CNN features, hand-crafted features and a Conditional Random Fields
CNN and hand-crafted features are applied to dense image patches to produce per pixel class probabilities
Pixel-level CRF infers a labelling that smooths regions while respecting the edges
Combination boosts the labelling accuracy
Evaluation on the ISPRS 2D semantic labelling challenge dataset

Motion & Pose Estimation Simultaneous Localization and Mapping
	FAB-MAP 3D: Topological mapping with spatial and visual appearance [pdf] [slide] Paul, Rohan and Newman, Paul	ICRA 2010 Paul2010ICRA

A probabilistic framework for appearance based navigation and mapping using spatial and visual appearance data
A bag-of-words approach in which positive or negative observations of visual words in a scene are used to discriminate between already visited and new places
Explicitly modelling of the spatial distribution of visual words as a random graph in which nodes are visual words and edges are distributions over distances
Representing locations as random graphs and learning a generative model over word occurrences as well as their spatial distributions
Special care for multi-modal distributions of inter-word spacing and for sensor errors both in word detection and distances
Viewpoint invariant inter-word distances as strong place signatures
Evaluated on a dataset gathered within New College, Oxford
Increased precision-recall area compared to a state-of-the-art visual appearance only
Reduced false positive and false negative rate by capturing spatial information, particularly in loop closure decision hinges

Object Detection 3D Object Detection from 2D Images
	Multi-View and 3D Deformable Part Models [pdf] [slide] Bojan Pepik and Michael Stark and Peter V. Gehler and Bernt Schiele	PAMI 2015 Pepik2015PAMI

Joint object localization and viewpoint estimation
Motivation
- Limited expressiveness of 2D feature-based models
- 3D object representations which can be robustly matched to image evidence
Extension of DPM to include viewpoint information and part-level 3D geometry information
- DPM as a structured output prediction task
- Consistency between parts across viewpoints
- Modelling the parts positions and displacement distributions in 3D
- Continuous appearance model
Several different models with different level of expressiveness
Leveraging 3D information from CAD data
Better than the state-of-the-art multi-view and 3D object detectors on KITTI, 3D object classes, Pascal3D+, Pascal VOC 2007, EPFL multi-view cars

Motion & Pose Estimation Ego-Motion Estimation
	Robust stereo visual odometry from monocular techniques [pdf] [slide] Mikael Persson and Tommaso Piccini and Michael Felsberg and Rudolf Mester	IV 2015 Persson2015IV

Presents a novel stereo visual odometry system for automotive applications based on advanced monocular techniques.

Contributions:
- Hypothesise that techniques developed for monocular visual odometry systems would be, in general, more refined and robust since they have to deal with an intrinsically more difficult problem
- Shows that the generalization of these techniques to the stereo case result in a significant improvement of the robustness and accuracy of stereo based visual odometry

Evaluates on KITTI dataset

Representations Stixels
	Towards a Global Optimal Multi-Layer Stixel Representation of Dense 3D Data [pdf] [slide] David Pfeiffer and Uwe Franke	BMVC 2011 Pfeiffer2011BMVC

Medium level representation: thin planar rectangles called Stixels
Motivation:
- Dominance of horizontal, vertical planar surfaces in man-made environments
- Structured access to the scene data
- Half a million disparity measurements to a few hundred Stixels only
Difference to BadinoDAGM2009¹:
- A unified global optimal scheme
- Objects at multiple depths in a column
Dynamic programming to incorporate real-world constraints (gravity, ordering)
An optimal segmentation with respect to free space and obstacle information
Results for stereo vision and laser data, but applicable to 3D data from other sensors

^{1. The stixel world - a compact medium level representation of the 3d-world. DAGM 2009}

Representations Stixels
	Efficient representation of traffic scenes by means of dynamic stixels [pdf] [slide] Pfeiffer, D. and Franke, U.	IV 2010 Pfeiffer2010IV

Pose and motion estimation of moving obstacles in traffic scenes
Stixel World is a compact and flexible representation but do not allow to infer motion information
Dense disparity images are used for the free space computation and extraction of the static stixel representation
Tracking of stixels using 6-Vision Kalman filter framework and dense optical flow
Lateral as well as longitudinal motion is estimated for each stixel
Simplifies grouping of stixels based on the motion as well as detection of moving obstacles
Demonstration on recorded data

Semantic Segmentation Road/Lane Detection
	High-performance long range obstacle detection using stereo vision [pdf] [slide] Peter Pinggera and Uwe Franke and Rudolf Mester	IROS 2015 Pinggera2015IROS

Existing methods designed for robust generic obstacle detection based on geometric criteria work best only in close to medium range applications

Contributions:
- Presents a novel method for the joint detection and localization of distant obstacles using a stereo vision system on a moving platform
- The proposed algorithm is based on statistical hypothesis tests using local geometric criteria and can implicitly handle non-flat ground surfaces
- Operates directly on image data instead of precomputed stereo disparity maps

Evaluates on stereo sequences introduced in Cordts et al., Object-level Priors for Stixel Generation

Reconstruction Stereo
	Know Your Limits: Accuracy of Long Range Stereoscopic Object Measurements in Practice [pdf] [slide] Peter Pinggera and David Pfeiffer and Uwe Franke and Rudolf Mester	ECCV 2014 Pinggera2014ECCV

Determining the location and velocity of potential obstacles for autonomous vehicles or advanced driver assistance systems
Middlebury and KITTI provide important reference values but do not sufficiently treat local sub-pixel matching accuracy
Comprehensive statistical evaluation of selected state-of-the-art stereo matching approaches on an extensive dataset
Establishing reference values for the precision limits actually achievable in practice
For a carefully calibrated camera setup under real-world imagining conditions a consistent error limit of 1/10 pixel is determined
Guidelines on algorithmic choices derived from theory to achieve this limit in practice

Semantic Segmentation Road/Lane Detection
	Lost and Found: detecting small road hazards for self-driving vehicles [pdf] [slide] Peter Pinggera and Sebastian Ramos and Stefan Gehrig and Uwe Franke and Carsten Rother and Rudolf Mester	IROS 2016 Pinggera2016IROS

Reliable detection of small obstacles from a moving vehicle using stereo vision
Statistical planar hypothesis tests in disparity space directly on stereo image data, assessing free-space and obstacle hypotheses
Introduce midlevel obstacle representation Cluster-Stixels based on the original point-based output
Does not depend on a global road model and handles static and moving obstacles
Evaluation on a novel lost-cargo image sequence dataset comprising more than two thousand frames with pixel-wise annotations
Comparison to several stereo-based baseline methods and runs at 20Hz on 2 mega-pixel stereo imagery
Small obstalces down to the height of 5 cm can successfully be detected at 20 m

Object Detection Human Pose Estimation
	DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation [pdf] [slide] Pishchulin, Leonid and Insafutdinov, Eldar and Tang, Siyu and Andres, Bjoern and Andriluka, Mykhaylo and Gehler, Peter V. and Schiele, Bernt	CVPR 2016 Pishchulin2016CVPR

Existing methods for human pose estimation use two-stage strategies that separate the detection and pose estimation steps

Contributions:
- Proposes a new formulation as a joint subset partitioning and labeling problem (SPLP) of a set of body-part hypotheses generated with CNN-based part detectors
- SPLP model jointly infers the number of people, their poses, spatial proximity, and part level occlusions
- Results show that a joint formulation is crucial to disambiguate multiple and potentially overlapping persons

Evaluates on LSP and MPII single-person benchmarks and MPII and WAF multi-person benchmarks

Semantic Segmentation Semantic Segmentation
	Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes [pdf] [slide] Tobias Pohlen and Alexander Hermans and Markus Mathias and Bastian Leibe	ARXIV 2016 Pohlen2016ARXIV

Existing methods additional processing steps have to be performed in order to obtain pixel-accurate segmentation masks at the full image resolution

Contributions:
- Proposes a novel ResNet-like architecture that exhibits strong localization and recognition performance
- Combines multi-scale context with pixel-level accuracy by using two processing streams within the network
- One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition
- The two streams are coupled at the full image resolution using residuals

Evaluates on the Cityscapes dataset

Reconstruction Reconstruction & Recognition
	Detailed Real-Time Urban 3D Reconstruction from Video [pdf] [slide] Pollefeys, M.	IJCV 2008 Pollefeys2008IJCV

Large scale, real-time 3D reconstruction incorporating GPS and INS or traditional SfM
Motivation:
- The massive amounts of data
- Lack of public high-quality ground-based models
Real-time performance (30Hz) using graphics hardware and standard CPUs
Extending state-of-the-art for robustness and variability necessary for outside:
- Large dynamic range: automatic gain adaptation for real-time stereo estimation
Fusion with GPS and inertial measurements using a Kalman filter
Two-step stereo reconstruction process exploiting the redundancy across frames
Real urban video sequences with hundreds of thousands of frames on GPU

Datasets & Benchmarks Synthetic Data
	UnrealCV: Connecting Computer Vision to Unreal Engine [pdf] [slide] Weichao Qiu and Alan L. Yuille	ARXIV 2016 Qiu2016ARXIV

Computer graphics can generate synthetic images and ground truth (object instance mask, depth, surface normal) while offering the possibility of constructing virtual worlds
Building on effort of game industry to create realistic 3D worlds, which a player can interact with
Access and modify the internal data structure of games to create virtual worlds, extracting groundtruth and controlling an agent
Created a open-source plugin UnrealCV for a popular game engine Unreal Engine 4
Linking Caffe with the virtual world to train/test deep networks
Diagnosing Faster-RCNN trained on PASCAL by testing it on the virtual world with varying rendering configurations

Motion & Pose Estimation 3D Motion Estimation -- Scene Flow
	Dense, Robust, and Accurate Motion Field Estimation from Stereo Image Sequences in Real-Time [pdf] [slide] Clemens Rabe and Thomas Mueller and Andreas Wedel and Uwe Franke	ECCV 2010 Rabe2010ECCV

Estimating the three-dimensional motion vector field from stereo image sequences
Combining variational optical flow with Kalman filtering for temporal smoothness
Real-time with parallel implementation on a GPU and an FPGA
Comparing
- Differential motion field estimation from optical flow (Horn & Schunck) and stereo (SGM)
- Variational scene flow from two frames
- Kalman filtered method, using dense optical flow and stereo (Dense6D)
- Filtered variational scene flow approach (Variational6D)
Dense6D and Variational6D perform similarly, the latter is computationally more complex.

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	Non-local Total Generalized Variation for Optical Flow Estimation [pdf] [slide] Rene Ranftl and Kristian Bredies and Thomas Pock	ECCV 2014 Ranftl2014ECCV

Total Generalized Variation
- Performs quite well favoring piecewise affine solutions
- Local nature can suffer from ambiguities in the data and cannot accurately locate discontinuities
Contribution
- Non local TGV that allows to incorporate prior information as image gradients
- Scale invariant Census using a radial sampling strategy
Evaluation on Sintel and KITTI 2012

Reconstruction Stereo
	Minimizing TGV-based Variational Models with Non-Convex Data terms [pdf] [slide] Rene Ranftl and Thomas Pock and Horst Bischof	SSVM 2013 Ranftl2013SSVM

Approximate minimization of variational models with Total Generalized Variation regularization (TGV) and non-convex data terms
Motivation:
- TGV is arguably a better prior than TV (piecewise affine solutions)
- TGV is restricted to convex data terms
- Convex approximations to the non-convex problem (coarse-to-fine warping: loss of details)
Decomposition of the functional into two subproblems which can be solved globally
One is convex, the other by lifting the functional to a higher dimensional space, where it is convex
Significant improvement compared to coarse-to-fine warping on stereo
Evaluated on KITTI stereo and Middlebury high-resolution benchmarks

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	Optical Flow Estimation using a Spatial Pyramid Network [pdf] [slide] Anurag Ranjan and Michael J. Black	ARXIV 2016 Ranjan2016ARXIV

Optical flow estimation with a coarse-to-fine deep learning approach
Image pyramid as in the standard variational formulations
Each layer a convolutional neural network estimates flow update of the warped images
Small networks with 5 convolutional layers sufficient because of small motions
Each network is trained independently
Learned filters resemble spatio-temporal filters
96 smaller and faster than FlowNet
Attractive for embedded systems
Outperforms FlowNet on Sintel and Middlebury

Cameras Models & Calibration Event Cameras
	EVO: A Geometric Approach to Event-based 6-DOF Parallel Tracking and Mapping in Real-time [pdf] [slide] Henri Rebecq and Timo Horstschaefer and Guillermo Gallego and Davide Scaramuzza	RAL 2016 Rebecq2016RAL

Object Detection 2D Object Detection
	Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks [pdf] [slide] Shaoqing Ren and Kaiming He and Ross B. Girshick and Jian Sun	NIPS 2015 Ren2015NIPS

Region Proposal Network (RPN) for object detection
Simultaneous prediction of object bounds and objectness scores at each position
Region proposals are the computational bottleneck for state-of-the-art detectors.
End-to-end training to generate region proposals for Fast R-CNN
Nearly cost-free region proposals
RPNs: a kind of fully-convolutional network (FCN)
Alternating optimization to train RPN and Fast R-CNN with shared features
5 fps (including all steps) on a GPU
State-of-the-art object detection accuracy on PASCAL VOC 2007

Datasets & Benchmarks Synthetic Data
	Playing for Data: Ground Truth from Computer Games [pdf] [slide] Stephan R. Richter and Vibhav Vineet and Stefan Roth and Vladlen Koltun	ECCV 2016 Richter2016ECCV

Creating pixel-accurate semantic label maps for images extracted from computer games
A wrapper between the game and the graphics hardware
- Pixel-accurate object signatures across time and instances
- By hashing distinct rendering resources such as geometry, textures, and shaders
25 thousand images
Models trained with game data and just ¹⁄₃ of the CamVid training set outperform models trained on the complete CamVid training set

Semantic Segmentation Semantic Segmentation of 3D Data
	OctNet: Learning Deep 3D Representations at High Resolutions [pdf] [slide] Gernot Riegler and Ali Osman Ulusoy and Andreas Geiger	ARXIV 2016 Riegler2016ARXIV

Deep and high resolution 3D convolutional networks for 3D tasks including 3D object classification, orientation estimation, and point cloud labelling
High activations only near the object boundaries
More memory and computation on relevant dense regions by exploiting sparsity
Hierarchically partitioning of the space using a set of unbalanced octrees where each leaf node stores a pooled feature representation
Deeper networks without compromising resolution
Convolution, pooling, unpooling directly defined on this structure
Higher input resolutions with significant speed-ups
- Particularly beneficial for orientation estimation and semantic point cloud labelling
Evaluated on ModelNet10, RueMonge2014

Semantic Segmentation Semantic Segmentation of Facades
	Learning Where to Classify in Multi-view Semantic Segmentation [pdf] [slide] Hayko Riemenschneider and Andras Bodis-Szomoru and Julien Weissenberg and Luc Van Gool	ECCV 2014 Riemenschneider2014ECCV

View overlap is ignored by existing work in semantic scene labelling, and features in all views for all surface parts are extracted redundantly and expensively

Contributions:
- Proposes an alternative approach for multi-view semantic labelling, efficiently combining the geometry of the 3D model and the appearance of a single, appropriately chosen view - denoted as reducing view redundancy
- Show the beneficial effect of reducing the initial labelling to a well-chosen subset of discriminative surface parts, and then using these labels to infer the labels of the remaining surface. This is denoted as scene coverage
- Accelerates the labelling by two orders of magnitude and make a finer-grained labelling of large models (e.g. of cities) practically feasible
- Provides a new 3D dataset of densely labelled images

Datasets & Benchmarks Synthetic Data
	The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes [pdf] [slide] German Ros and Laura Sellart and Joanna Materzynska and David Vazquez and Antonio Lopez	CVPR 2016 Ros2016CVPR

Proposes to use a virtual world to automatically generate realistic synthetic images with pixel-level semantic segmentation annotation

Contributions:
- A new dataset SYNTHIA, for semantic segmentation of driving scenes with more than 213,400 syn- thetic images including both, random snapshots and video sequences in a virtual city
- Images are generated simulating different seasons, weather and illumination conditions from multiple view-points
- Experiments showed that SYNTHIA is good enough to produce good segmentations by itself on real datasets, dramatically boosting accuracy in combination with real data.

Datasets & Benchmarks Real Data
	ISPRS Test Project on Urban Classification and 3D Building Reconstruction [pdf] [slide] Franz Rottensteiner and Gunho Sohn and Markus Gerke and Jan Dirk Wegner	Book 2013 Rottensteiner2013

Benchmark for Urban Object Detection & 3D Building Reconstruction
Urban Object Detection - In this context, the task is to determine the outlines of objects in the input airborne images. Training data are available for a variety of object classes, including buildings, roads, trees, and cars
3D Building Reconstruction: The task is to reconstruct detailed 3D roof structures for input test airborne images. Detailed 3D models of roofs are available as reference data

Exisiting datasets are outdated due to the fact that they are based on scanned aerial images acquired by analog cameras
Makes use of the full benefits of modern airborne data, including multiple-overlap geometry, increased radiometric and spectral resolution

Datasets & Benchmarks Real Data
	Results of the ISPRS benchmark on urban object detection and 3D building reconstruction [pdf] [slide] Franz Rottensteiner and Gunho Sohn and Markus Gerke and Jan Dirk Wegner and Uwe Breitkopf and Jaewook Jung	JPRS 2014 Rottensteiner2014JPRS

Extraction of urban objects from data acquired by airborne sensors
Evaluation of methods for building detection, tree detection and 3D building reconstruction
Considering ISPRS benchmark dataset, consisting of airborne image and laserscans
Comparison and analysis to identify promising strategies and common problems
Building detection can be satisfactorily solved for buildings larger than 50m^2
Tree detection successful in detecting large tress under favorable conditions
Production of geometrically and topologically correct LoD2 buildings models still poses challenges

Motion & Pose Estimation Localization
	Hyperpoints and Fine Vocabularies for Large-Scale Location Recognition [pdf] [slide] Torsten Sattler and Michal Havlena and Filip Radenovic and Konrad Schindler and Marc Pollefeys	ICCV 2015 Sattler2015ICCV

Large-scale structure-based localization
Problem: ineffective descriptor matching due to large memory footprint and the strictness of the ratio test in 3D
Previous approaches:
- Smart compression of the 3D model
- Clever sampling strategies for geometric verification
Implicit feature matching by quantization into a fine vocabulary
Using all the 3D points and standard sampling
Locally unique 2D-3D point assignment by a simple voting strategy to enforce the co-visibility of the selected 3D points
Evaluation on SF-0, Landmarks datasets
State-of-the-art performance with reduced memory footprint by storing only visual word labels

Motion & Pose Estimation Localization
	Efficient Effective Prioritized Matching for Large-Scale Image-Based Localization [pdf] [slide] T. Sattler and B. Leibe and L. Kobbelt	PAMI 2016 Sattler2016PAMI

Accurately determining the position and orientation from which an image was taken using SfM point clouds
Direct matching strategy comparing descriptors of the 2D query features and the 3D points in the model
Vocabulary-based prioritized matching step is able to consider features more likely to yield 2D-to-3D matches
Terminating the correspondence search as soon as enough matches have been found
Visibility information from reconstruction process used to improve the efficiency
Efficiently handling large-scale 3D models
Evaluation on Dubrovnik, Rome and Vienna dataset used as standard benchmark for image-based localization

Motion & Pose Estimation Ego-Motion Estimation
	Real-time monocular visual odometry for on-road vehicles with 1-point RANSAC [pdf] [slide] Davide Scaramuzza and Friedrich Fraundorfer and Roland Siegwart	ICRA 2009 Scaramuzza2009ICRA

Presents a system capable of recovering the trajectory of a vehicle from the video input of a single camera at a very high frame-rate

Contributions:
- The algorithm proposes a novel way of removing the outliers of the feature matching process
- Show that by exploiting the nonholonomic constraints of wheeled vehicles it is possible to use a restrictive motion model
- This allows to parameterize the motion with only 1 feature correspondence

Evaluates on real traffic sequencees in the city center of Zurich

Cameras Models & Calibration Omnidirectional Cameras
	A Toolbox for Easily Calibrating Omnidirectional Cameras [pdf] [slide] Davide Scaramuzza and Agostino Martinelli	IROS 2006 Scaramuzza2006IROS

Fast and automatic calibration of central omnidirectional cameras, both dioptric and catadioptric
Requiring a few images of a checker board, and clicking on its corner points
No need for specific model of the omnidirectional sensor
Imaging function by a Taylor series expansion whose coefficients are estimated by
- solving a four-step least-squares linear minimization problem
- a non-linear refinement based on the maximum likelihood criterion
Evaluation on both simulated and real data
Showing calibration accuracy by projecting the color information of a calibrated camera on real 3D points extracted by a 3D sick laser range finder
A Matlab toolbox

Cameras Models & Calibration Omnidirectional Cameras
	Appearance-Guided Monocular Omnidirectional Visual Odometry for Outdoor Ground Vehicles [pdf] [slide] Scaramuzza, D. and Siegwart, R.	TR 2008 Scaramuzza2008TR

Describes a real-time algorithm for computing the ego-motion of a vehicle relative to the road
Uses as input only those images provided by a single omnidirectional camera mounted on the roof of the vehicle

The front ends of the system are two different trackers:
- The first one is a homography-based tracker that detects and matches robust scale-invariant features that most likely belong to the ground plane
- The second one uses an appearance-based approach and gives high-resolution estimates of the rotation of the vehicle

Camera trajectory estimated from omnidirectional images over a distance of 400m. For performance evaluation, the estimated path is superimposed onto a satellite image

Datasets & Benchmarks Real Data
	A taxonomy and evaluation of dense two-frame stereo correspondence algorithms [pdf] [slide] Scharstein, Daniel and Szeliski, Richard	IJCV 2002 Scharstein2002IJCV

Presents a taxonomy of dense, two-frame stereo methods designed to assess the different components of individual stereo algorithms
Uses this taxonomy to highlight the most important features of existing stereo algorithms and to study important algorithmic components in isolation
Provides a test bed for the quantitative evaluation of stereo algorithms with sample implementations along with test data
Produces new calibrated multi-view stereo data sets with hand-labeled ground truth
Performs an extensive experimental investigation in order to assess the impact of the different algorithmic components
Demonstrates the limitations of local methods & assesses the value of different global techniques &s their sensitivity to key parameters

Representations Stixels
	Semantic Stixels: Depth is not enough [pdf] [slide] Lukas Schneider and Marius Cordts and Timo Rehfeld and David Pfeiffer and Markus Enzweiler and Uwe Franke and Marc Pollefeys and Stefan Roth	IV 2016 Schneider2016IV

Joint inference of geometric and semantic layout of a scene using stixels
Geometry as a dense disparity map (SGM)
Semantics as a pixel-level semantic scene labelling (CNNs)
Stixel representation with object class information
Better than original Stixel model in terms of geometric accuracy
Complexity (time): linear in the number of object classes (15 Hz on 2 MP images)
Evaluated on the subset of KITTI 2012 annotated semantically, KITTI 2015 (only disparity), Cityscapes (only semantics)

Cameras Models & Calibration Omnidirectional Cameras
	Omnidirectional 3D Reconstruction in Augmented Manhattan Worlds [pdf] [slide] Miriam Schnbein and Andreas Geiger	IROS 2014 Schoenbein2014IROS

High-quality omnidirectional 3D reconstruction from catadioptric stereo video sequences
Optimization of depth jointly in a unified omnidirectional space
Applying plane-based prior even though planes in 3D do not project to planes in the omnidirectional domain
Omnidirectional slanted-plane Markov random field model
Plane hypotheses are extracted using a novel voting scheme for 3D planes in omnidirectional space
Evaluation on novel dataset captured using autonomous driving platform AnnieWAY with Velodyne HDL-64E laser scanner for ground truth depth
Outperforms stereo matching techniques quantitatively and qualitatively

Cameras Models & Calibration Omnidirectional Cameras
	Calibrating and Centering Quasi-Central Catadioptric Cameras [pdf] [slide] Miriam Schnbein and Tobias Strauss and Andreas Geiger	ICRA 2014 Schoenbein2014ICRA

Omnidirectional 3D reconstruction of augmented Manhattan worlds from catadioptric stereo video sequences
Optimizing depth jointly in a unified omnidirectional space in contrast to constructing virtual perspective views
An omnidirectional slanted-plane MRF model based on superpixels
Plane-based prior models using a voting scheme for 3D planes in omnidirectional space
Loopy BP to find the best plane hypothesis for each superpixel as a discrete labelling problem
A new dataset captured using two horizontally aligned catadioptric cameras and a Velodyne HDL-64E laser scanner for ground truth depth (AnnieWAY)
Better than existing stereo methods thanks to unified view, with reduced noise a compact plane representation

Motion & Pose Estimation Localization
	LaneLoc: Lane marking based localization using highly accurate maps [pdf] [slide] Markus Schreiber and Carsten Knoppel and Uwe Franke	IV 2013 Schreiber2013IV

Precise localization relative to the given map in real-world traffic scenarios
Motivation:
- INS¹ combining IMU², GNSS³ cannot achieve precision required in typical traffic scenes (in the range of a few centimeters).
- A localization system that is independent of satellite systems
Using a stereo camera system, IMU data of the vehicle, and a highly accurate map with curbs and road markings
Beforehand creation of maps using an extended sensor setup
Initialization using GNSS positiotion
Kalman Filter based localization achieving an accuracy in the range of 10 cm in real-time
Evaluation on a test track and approximately 50 km of rural roads

^{1. Inertial Navigation Systems}
^{2. Inertial Measurement Unit}
^{3. Global Navigation Satellite System}

Scene Understanding
	Learning from Maps: Visual Common Sense for Autonomous Driving [pdf] [slide] Ari Seff and Jianxiong Xiao	ARXIV 2016 Seff2016ARXIV

Road layout inference from a single RGB image, without high-definition maps
An automatically labelled, large-scale dataset
- By matching road vectors and meta-data from navigation maps with Google Street View images
- Ground truth road layout attributes
Training AlexNet to predict the road layout attributes (a separate network for each task)
Comparably to or better than the human baselines except for number of lanes estimation
Possibility to extend to recommending safety improvements (e.g., suggesting an alternative speed limit for a street)

Semantic Segmentation Semantic Segmentation of 3D Data
	Urban 3D Semantic Modelling Using Stereo Vision [pdf] [slide] Sengupta, Sunando and Greveson, Eric and Shahrokni, Ali and Torr, Philip HS	ICRA 2013 Sengupta2013ICRA

Efficient and accurate dense 3D reconstruction with associated semantic labellings from street level stereo image pairs
Using a robust visual odometry method with effective feature matching
Depth-maps, generated from stereo, are fused into a global 3D volume online
Labelling of street level images using a CRF exploiting stereo images
Label estimates are aggregated to annotate the 3D volume
Evaluation on KITTI odometry dataset with manual annotation for object class segmentation

Semantic Segmentation Semantic Segmentation of 3D Data
	Automatic dense visual semantic mapping from street-level imagery. [pdf] [slide] Sengupta, Sunando and Sturgess, Paul and Ladicky, Lubor and Torr, Philip H. S.	IROS 2012 Sengupta2012IROS

Describes a method for producing a semantic map from multi-view street-level imagery
Defines a semantic map as an overhead, or birds eye view of a region with associated semantic object labels, such as car, road and pavement

Formulates the problem using two conditional random fields:
- The first is used to model the semantic image segmentation of the street view imagery treating each image independently
- The outputs of this stage are then aggregated over many images to form the input for our semantic map that is a second random field defined over a ground plane
- Each image is related by a geometrical function that back projects a region from the street view image into the overhead ground plane map.

Evaluates on introduced and make publicly available, a new dataset created from real world data

Motion & Pose Estimation Localization
	Accurate Geo-Registration by Ground-to-Aerial Image Matching [pdf] [slide] Qi Shan and Changchang Wu and Brian Curless and Yasutaka Furukawa and Carlos Hernandez and Steven M. Seitz	THREEDV 2014 Shan2014THREEDV

Geo-registering ground-based multi-view stereo models by ground-to-aerial image matching
Fully automated matching method that handles ground to aerial viewpoint variation
- Approximate ground-based MVS model by GPS-based geo-registration using EXIF tags
- Retrieve oblique aerial views from Google Maps based on estimated geo-location
- Feature matches between ground and aerial images for pixel-level accuracy
Large-scale experiments which consist of many popular outdoor landmarks in Rome using images from Flickr
Outperforms state-of-the-art significantly and yields geo-registration at pixel-level accuracy

End-to-End Learning of Sensorimotor Control
	Learning to Drive using Inverse Reinforcement Learning and Deep Q-Networks [pdf] [slide] S. Sharifzadeh and I. Chiotellis and R. Triebel and D. Cremers	NIPSWORK 2016 Sharifzadeh2016NIPSWORK

Contributions:
- Proposes use of Deep Q-Networks as the refinement step in Inverse Reinforcement Learning approaches
- This allows extraction of the rewards in scenarios with large state spaces such as driving
- Simulated agent generates collision-free motions and performs human-like lane change behaviour

Evaluate the performance in a simulation-based autonomous driving scenario

Object Detection Person Detection
	Pedestrian detection for driving assistance systems: Single-frame classification and system level performance [pdf] [slide] A. Shashua and Y. Gdalyahu and G. Hayun	IV 2004 Shashua2004IV

Functional and architectural breakdown of a monocular pedestrian detection system targeting on-board driving assistance application
Single-frame classification based on a novel scheme of breaking down the class variability
Repeatedly training a set of relatively simple classifiers on clusters of training set
Integration of additional cues in a final system measured over time (dynamic gait, motion parallax, stability of re-detection)
Training and evaluation on recorded data

Semantic Segmentation Semantic Segmentation of Aerial Images
	Fully Convolutional Networks for Dense Semantic Labelling of High-Resolution Aerial Imagery [pdf] [slide] Jamie Sherrah	ARXIV 2016 Sherrah2016ARXIV

Full resolution semantic segmentation of high-resolution aerial imagery
No down-sampling: no need for deconvolution or interpolation
- During training the softmax layer outputs are upsampled to full resolution with bilinear interpolation.
A hybrid network that combines the pre-trained image features on ImageNet with DSM (Digital Surface Model) features that are trained from scratch
State-of-the-art accuracy on the ISPRS Vaihingen and Potsdam benchmarks
No-downsampling approach compared to downsampling FCN: faster training and higher accuracy

Motion & Pose Estimation Ego-Motion Estimation
	Robust Scale Estimation in Real-Time Monocular SFM for Autonomous Driving [pdf] [slide] Shiyu Song and Manmohan Chandraker	CVPR 2014 Song2014CVPR

Scale drift is a crucial challenge for monocular autonomous driving to emulate the performance of stereo
Presents a real-time monocular SFM system that corrects for scale drift using a novel cue combination framework for ground plane estimation

Contributions:
- A novel data-driven framework that combines multiple cues for ground plane estimation using learned models to adaptively weight per-frame observation covariances
- Highly accurate, robust, scale-corrected and real-time monocular SFM with performance comparable to stereo
- Novel use of detection cues for ground estimation, which boosts 3D object localization accuracy

Evaluates on KITTI dataset

Semantic Segmentation Road/Lane Detection
	The path less taken: A fast variational approach for scene segmentation used for closed loop control [pdf] [slide] T. Suleymanov and L. M. Paz and P. Pinis and G. Hester and P. Newman	IROS 2016 Suleymanov2016IROS

Tracking State-of-the-Art on KITTI
	Multi-person Tracking by Multicut and Deep Matching [pdf] [slide] Siyu Tang and Bjoern Andres and Mykhaylo Andriluka and Bernt Schiele	ECCV 2016 Tang2016ECCV

Multi-person tracking by extending previous work¹: A graph-based formulation that links and clusters person hypotheses over time by solving a minimum cost subgraph multicut problem
Local pairwise feature based on local appearance matching that is robust to partial occlusion and camera motion (DeepMatching)
Comparison of different pairwise potentials
Analysis of the robustness of the tracking formulation
A plain multicut problem by removing outlying clusters
Applicable to long videos and many detections
No need for the intermediate tracklet representation
State-of-the-art performance on MOT16 benchmark

^{1. Subgraph decomposition for multi-target tracking. CVPR 2015}

Tracking Person Tracking
	Multi-person Tracking by Multicut and Deep Matching [pdf] [slide] Siyu Tang and Bjoern Andres and Mykhaylo Andriluka and Bernt Schiele	ECCVWORK 2016 Tang2016ECCVWORK

Reconstruction Multi-view 3D Reconstruction
	DENSER Cities: A System for Dense Efficient Reconstructions of Cities [pdf] [slide] Tanner, Michael and Pinies, Pedro and Paz, Lina Maria and Newman, Paul	ARXIV 2016 Tanner2016ARXIV

Semantic Segmentation Semantic Instance Segmentation
	Pixel-Level Encoding and Depth Layering for Instance-Level Semantic Labeling [pdf] [slide] Jonas Uhrig and Marius Cordts and Uwe Franke and Thomas Brox	GCPR 2016 Uhrig2016GCPR

Existing state-of-the-art methods have augmented convolutional neural networks (CNNs) with complex multitask architectures or computationally expensive graphical models

Contributions:
- Presents a fully convolutional network that predicts pixel-wise depth, semantics, and instance-level direction cues for holistic scene understanding
- Instead of complex architectures or graphical models this performs post-processing using only standard computer vision techniques applied to the networks 3 output channels
- This approach does not depend on region proposals and scales for arbitrary numbers of object instances in an image

Evaluates KITTI and Cityscapes instance segmentation datasets

Semantic Segmentation Semantic Segmentation of 3D Data
	Mesh based semantic modelling for indoor and outdoor scenes [pdf] [slide] Valentin, Julien PC and Sengupta, Sunando and Warrell, Jonathan and Shahrokni, Ali and Torr, Philip HS	CVPR 2013 Valentin2013CVPR

Object labelling in 3D
A triangulated meshed representation of the scene from multiple depth estimates
- TSDF followed by surface reconstruction
CRF over the mesh combining information from
- Geometric properties (from the 3D mesh)
- Appearance properties (from images)
Local interactions by difference in colour and geometry of neighbouring faces
Evaluated in both indoor and outdoor scenes:
- Augmented version of the NYU indoor scene dataset
- Ground truth object labellings for the KITTI odometry dataset

Semantic Segmentation Semantic Segmentation of Aerial Images
	Detecting parametric objects in large scenes by Monte Carlo sampling [pdf] [slide] Verdie, Yannick and Lafarge, Florent	IJCV 2014 Verdie2014IJCV

Markov point processes are probabilistic models introduced to extend the traditional MRFs by using an object-based formalism
Markov point processes can address object recognition problems by directly manipulating parametric entities in dynamic graphs,whereas MRFs are restricted to labeling problems in static graphs

Contributions:
- Contrary to the conventional MCMC sampler which evolves solution by successive perturbations, it can perform a large number of perturbations simultaneously
- Proposes an efficient mechanism for modifications of objects by using spatial information extracted from the observed data
- Proposes an implementation on GPU which significantly reduces computation times with respect to existing algorithms
- To evaluate the performance of the sampler, proposes original point processe for detecting complex 3D objects in large-scale point clouds

Semantic Segmentation Label Propagation
	Active Frame Selection for Label Propagation in Videos [pdf] [slide] Sudheendra Vijayanarasimhan and Kristen Grauman	ECCV 2012 Vijayanarasimhan2012ECCV

Existing methods simply propagate annotations from arbitrarily selected frames and so may fail to best leverage the human effort invested
Defines an active frame selection problem: select k frames for manual labeling, such that automatic pixel-level label propagation can proceed with minimal expected error

Contributions:
- Proposes a solution that directly ties a joint frame selection criterion to the predicted errors of a flow-based random field propagation model
- Derives an efficient dynamic programming solution to optimize the criterion
- Shows how to automatically determine how many total frames k should be labeled in order to minimize the total manual effort & correcting propagation errors

Evaluates on Labelme, Camseq, Segtrack, and Camvid datasets

Semantic Segmentation Semantic Segmentation of 3D Data
	Incremental Dense Semantic Stereo Fusion for Large-Scale Semantic Scene Reconstruction [pdf] [slide] Vibhav Vineet and Ondrej Miksik and Morten Lidegaard and Matthias Niessner and Stuart Golodetz and Victor A. Prisacariu and Olaf Kahler and David W. Murray and Shahram Izadi and Patrick Perez and Philip H. S. Torr	ICRA 2015 Vineet2015ICRA

Dense, large-scale, outdoor semantic reconstruction of a scene
Near real-time using GPUs (features not included)
Hash-based technique for large-scale fusion
More reliable visual odometry instead of ICP camera pose estimation
2D features and unaries based on random forest classifier for semantic segmentation and transferring them to 3D volume
An online volumetric mean-field inference algorithm for densely-connected CRFs
A semantic fusion approach to handle dynamic objects
Output: Per-voxel probability distribution instead of a single label
Evaluated on KITTI
Semantic fusion improves segmentation results, especially for cars.
Reconstruction improves upon initial depth estimation.
Sharp boundaries on sequences captured using a head-mounted stereo camera

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	An Evaluation of Data Costs for Optical Flow [pdf] [slide] Christoph Vogel and Stefan Roth and Konrad Schindler	GCPR 2013 Vogel2013GCPR

Appropriate data cost functions necessary for outdoor challenges like shadows, reflections
Evaluation so far
- certain types of data costs
- data without outdoor challenges
Contribution
- Systematic evaluation of pixel- and patch-based data costs (Brightness constancy, normalized cross correlation, mutual information, census transform)
- Approximation of census transform for gradient-based methods
- Unified state-of-the-art testbed
- Evaluation on realistic KITTI dataset
On real world data patch-based perform better than pixel-based costs
Census transform slightly outperforms all others

Motion & Pose Estimation 3D Motion Estimation -- Scene Flow
	3D scene flow estimation with a piecewise rigid scene model [pdf] [slide] Christoph Vogel and Konrad Schindler and Stefan Roth	IJCV 2015 Vogel2015IJCV

Limitations of existing methods:
- Conventional pixel-based representations require large number of parameters leading to challenging inference
- Parameterize w.r.t. a single viewpoint and therefore may ignore important evidence present in other views

Contributions:
- Represents dynamic scenes as a collection of planar regions, each undergoing a rigid motion
- Represents 3D shape and motion w.r.t. every image in a time interval while demanding consistency of the representations

Evaluates on stereo and flow KITTI benchmarks

Motion & Pose Estimation Localization
	Image-based Localization with Spatial LSTMs [pdf] [slide] Florian Walch and Caner Hazirbas and Laura Leal-Taixe and Torsten Sattler and Sebastian Hilsenbeck and Daniel Cremers	ARXIV 2016 Walch2016ARXIV

Proposes a new CNN+LSTM architecture for camera pose regression for indoor and outdoor scenes

Contributions:
- Provides an extensive quantitative comparison of CNN-based vs SIFT-based localization methods
- introduces a new challenging large indoor benchmark with accurate ground truth pose information

Evaluates on outdoor scenes like Cambridge and indoor scenes such as 7Scenes, LSI Localization datasets

Object Detection 3D Object Detection from 3D Point Clouds
	Voting for Voting in Online Point Cloud Object Detection [pdf] [slide] Dominic Zeng Wang and Ingmar Posner	RSS 2015 Wang2015RSS

Sliding window approach for laser-based 3D object detection
A voting scheme by exploiting sparsity
- Enabling a search through all putative object locations at any orientation
- Mathematically equivalent to a convolution on a sparse feature grid (a linear classifier)
- Processing in full 3D, irrespective of the number of vantage points
Highly parallelisable (processing 100K points at eight orientations in less than 0.5s)
The best-in-class detection and timing for car, pedestrian and bicyclist on KITTI

Datasets & Benchmarks Real Data
	TorontoCity: Seeing the World with a Million Eyes [pdf] [slide] Shenlong Wang and Min Bai and Gellert Mattyus and Hang Chu and Wenjie Luo and Bin Yang and Justin Liang and Joel Cheverie and Sanja Fidler and Raquel Urtasun	ARXIV 2016 Wang2016ARXIV

Large-scale benchmark for multiple tasks covering full greater Toronto area (GTA) with 712.5km 2 of land, 8439km of road and around 400,000 buildings
Limitations of current benchmarks:
- Small set of sensors
- Lack of rich semantics and 3D information at a large-scale
Joint reasoning about geometry, grouping and semantics (three R's of computer vision)
Captured from airplanes, drones and cars driving around the city
Tasks: building height estimation (reconstruction), road centerline and curb extraction, building instance segmentation, building contour extraction (reorganization), semantic labeling and scene type classification (recognition)
Utilizing different sources of high-precision maps to create ground truth
Aligning all data sources with the maps while requiring minimal human supervision
State-of-the-art methods work well on semantic segmentation and scene classification.
Open challenges: instance segmentation, contour extraction and height estimation
Many possible extensions

Motion & Pose Estimation 3D Motion Estimation -- Scene Flow
	Stereoscopic scene flow computation for 3D motion understanding [pdf] [slide] A. Wedel and T. Brox and T. Vaudrey and C. Rabe and U. Franke and D. Cremers	IJCV 2011 Wedel2011IJCV

3D motion estimation using a variational framework and depth estimation
Decoupling motion from depth estimation
- Allows to use most suitable method for the two problems
- Stereo matching used as constraint for the motion estimation
- Faster computation on FPGA (depth) and GPU (motion)
Use TV-L2 smoothing to remove illumination differences between images
Energy-based uncertainty measure from motion estimation improves motion segmentation
Evaluation on the synthetic data (rotating sphere and Povray Traffic Scene)
Qualitative results on real-world scenes

Semantic Segmentation Road/Lane Detection
	B-Spline Modeling of Road Surfaces with an Application to Free Space Estimation [pdf] [slide] A. Wedel and C. Rabe and H. Badino and H. Loose and U. Franke and D. Cremers	TITS 2009 Wedel2009TITS

Planar road surface assumption is not modeling slope changes and cannot be used to restrict the free space
Representation of the visible road surface based on general parametric B-spline curve
Surface parameters are estimated from stereo measurements in the free space and are tracked over time using a Kalman filter
Adopt a road-obstacle segmentation algorithm to use the B-spline road representation
Evaluation on recorded data shows accurate free space estimation when the planar assumption fails

Semantic Segmentation Semantic Segmentation of Aerial Images
	Cataloging Public Objects Using Aerial and Street-Level Images - Urban Trees [pdf] [slide] Wegner, Jan D. and Branson, Steven and Hall, David and Schindler, Konrad and Perona, Pietro	CVPR 2016 Wegner2016CVPR

Public tree cataloguing (of location and species of trees) system from online maps
Motivation:
- Large-scale tree mapping project called Opentreemap
- Currently carried out with specialized imagery (LiDAR, hyperspectral) that is collected ad-hoc, and/or with in-person visits
det2geo: detects the set of locations of objects of a given category
geo2cat: computes the fine-grained category of the 3D object at a given location
Challenge: Combining multiple aerial and street-level views
Adapting state-of-the-art CNN-based object detectors and classifiers
Pasadena Urban Trees dataset: 80,000 trees with geographic and species annotations
Multi-view recognition over single view
- Mean average precision from 42 to 71 for tree detection
- Accuracy from 70 to 80 for tree species recognition

Semantic Segmentation Semantic Segmentation of Aerial Images
	A Higher-Order CRF Model for Road Network Extraction [pdf] [slide] Jan Dirk Wegner and Javier A. Montoya-Zegarra and Konrad Schindler	CVPR 2013 Wegner2013CVPR

Extract road network from aerial images
Problem: Pairwise potentials smooth out thin structures
Novel CRF with higher-order cliques connecting superpixel along line segments as prior
Sampling scheme that concentrates on most relevant cliques with a data-driven approach
Random Forest unaries
Evaluation on Graz and Vaihingen road network dataset
Outperforms a simple smoothness and heuristic rule-based baseline

Semantic Segmentation Semantic Segmentation of Aerial Images
	Road networks as collections of minimum cost paths [pdf] [slide] Wegner, Jan Dirk and Montoya-Zegarra, Javier Alexander and Schindler, Konrad	JPRS 2015 Wegner2015JPRS

Road extraction usually tackled with rule-based approaches
Extension of their work that was enforcing the road to lie on line segments
Create a large, over-complete set of candidates with minimum cost paths
Minimum cost paths allows the regularization to arbitrary paths
Map inference in a high-order CRF is used to select the optimal candidates
Random forest classifier used as unary
Evaluation on Graz and Vaihingen road network dataset

Reconstruction Reconstruction & Recognition
	A Data-driven Regularization Model for Stereo and Flow [pdf] [slide] D. Wei and C. Liu and W.T. Freeman	THREEDV 2014 Wei2014THREEDV

Resolving local ambiguity of the disparity or flow
- by considering the semantic information
- without explicit object modelling
Data driven approach:
- Transferring shape information from semantically matched patches in the database
- Relative-relationship transfer (by subtracting disparity at the center pixel) rather than data-term transfer (absolute values)
- Similar local shape information while absolute disparity values differ
A standard MRF model using gradient descent for inference
Comparable or better results on the KITTI stereo and flow datasets Improved results on the Sintel flow dataset

General Literature
	Handbook of Driver Assistance Systems [pdf] [slide] Winner, H. and Hakuli, S. and Lotz, F. and Singer, C. and Geiger, Andreas and others	Book 2015 Winner2015eng

Scene Understanding
	Monocular 3D Scene Modeling and Inference: Understanding Multi-Object Traffic Scenes [pdf] [slide] Wojek, C. and Roth, S. and Schindler, K. and Schiele, B.	ECCV 2010 Wojek2010ECCV

A probabilistic 3D scene model for multi-class object detection, object tracking, scene labelling, and 3D geometric relations
A consistent 3D description of a scene using only monocular video
Complex interactions like inter-object occlusion, physical exclusion between objects, geometric context
RJMCMC for inference and HMM for long-term associations in scene tracking
Better than state-of-the-art in 3D multi-people tracking (ETH-Loewenplatz)
A new, challenging dataset for 3D tracking of cars and trucks: MPI-VehicleScenes

Scene Understanding
	A Dynamic Conditional Random Field Model for Joint Labeling of Object and Scene Classes [pdf] [slide] Wojek, Christian and Schiele, Bernt	ECCV 2008 Wojek2008ECCV

Proposes a novel approach based on conditional random field (CRF) models to integrate both object detection and scene labeling in one framework

Contributions:
- Formulates the integration as a joint labeling problem of object and scene classes
- Systematically integrates dynamic information for the object detection task as well as for the scene labeling task

Evaluates on Sowerby database and a new dynamic scenes dataset

Scene Understanding
	Monocular 3D Scene Understanding with Explicit Occlusion Reasoning [pdf] [slide] Christian Wojek and Stefan Walk and Stefan Roth and Bernt Schiele	CVPR 2011 Wojek2011CVPR

Monocular 3D scene tracking-by-detection witch explicit object-object occlusion reasoning
Tracking the complete scene rather than an assembly of individuals
Extension of detection approaches HOG and DPM to enable the detection of partially visible humans
Integration of the detections into a 3D scene model
Full object and object part detectors are combined in a mixture of experts based on visibility
Visibility is obtained from the 3D scene model
More robust detection and tracking of partially visible pedestrians
Evaluation on two challenging sequences ETH-Linthescher and ETH-PedCross2 recorded from a moving car in busy pedestrian zones

Scene Understanding
	Monocular Visual Scene Understanding: Understanding Multi-Object Traffic Scenes [pdf] [slide] Christian Wojek and Stefan Walk and Stefan Roth and Konrad Schindler and Schiele, Bernt	PAMI 2013 Wojek2013PAMI

A probabilistic 3D scene model for multi-class object detection, object tracking, scene labelling, and 3D geometric relations using monocular video as input
Extension of Wojek2010ECCV¹ with explicit occlusion reasoning for tracking objects that are partially occluded or that have never been observed to their full extent
Evaluated on ETH-Loewenplatz, ETH-Linthescher, ETH-PedCross2, MPI-VehicleScenes
Robust performance due to
- a strong tracking-by-detection framework with tracklets
- exploiting 3D scene context by combining multiple cues
Explicit occlusion reasoning improves results on all sequences.
Long-term tracking with an HMM does not lead to additional gains.
Improvement over state-of-the-art object detectors, a stereo-based system, a competing monocular system, basic Kalman filters

^{1. Monocular 3D Scene Modeling and Inference: Understanding Multi-Object Traffic Scenes, ECCV 2010}

Object Detection Person Detection
	Multi-Cue Onboard Pedestrian Detection [pdf] [slide] C. Wojek and S. Walk and B. Schiele	CVPR 2009 Wojek2009CVPR

Detecting pedestrians using an onboard camera
Existing methods rely on static image features only despite the obvious potential of motion information for people detection

Contributions:
- Shows that motion cues provide a valuable feature, also for detection from a moving platform
- Shows that MPLBoost and histogram intersection kernel SVMs can successfully learn a multi-viewpoint pedestrian detector and often out- perform linear SVMs
- Introduces new realistic and publicly available onboard dataset (TUD-Brussels) containing multi-viewpoint data is introduced

Evaluates on ETH-Person, TUD-Brussels dataset

Motion & Pose Estimation Localization
	Regularity-Driven Facade Matching Between Aerial and Street Views [pdf] [slide] Wolff, Mark and Collins, Robert T. and Liu, Yanxi	CVPR 2016 Wolff2016CVPR

Detecting and matching building facades between aerial view and street-view images
Challenges beyond patch matching and ground-level-only wide-baseline facade matching
Exploiting the regularity of urban scene facades
Using a lattice and its associated median tiles (motifs) as the basis for matching
Joint regularity optimization problem, seeking well-defined features that reoccur across both facades to serve as match indicators
Matching costs based on edge shape contexts, color features, and Gabor filter responses
Evaluated on three cities
Superior performance over baselines SIFT, Root-SIFT, and Scale- Selective Self-Similarity and Binary Coherent Edge descriptors

Motion & Pose Estimation Localization
	Wide-Area Image Geolocalization with Aerial Reference Imagery [pdf] [slide] Scott Workman and Richard Souvenir and Nathan Jacobs	ICCV 2015 Workman2015ICCV

Proposes to use deep convolutional neural networks to address the problem of cross-view image geolocalization
Geolocation of a ground-level query image is estimated by matching to georeferenced aerial images

Contributions:
- Evaluation of off-the-shelf CNN network architectures & target label spaces for the problem of cross- view localization
- Cross-view training for learning a joint semantic feature space from different image sources

Evaluates on new dataset that contains pairs of aerial and ground-level images from across the United States.

Object Detection 2D Object Detection
	Learning And-Or Model to Represent Context and Occlusion for Car Detection and Viewpoint Estimation [pdf] [slide] Tianfu Wu and Bo Li and Song-Chun Zhu	PAMI 2016 Wu2016PAMI

Car detection and viewpoint estimation from images
And-Or model embeds a grammar for representing large structural and appearance variations in a reconfigurable hierarchy
Learning an And-Or model that takes into account structural and appearance variations at multi-car, single-car and part levels jointly
Learning process consists of two stages in a weakly supervised way
- The structure of the model is learned mining multi-car contextual patterns, occlusion configurations, combination of parts
- Model parameters are jointly trained using Weak-Label Structural SVM
Evaluation of car detection with KITTI, PASCAL VOC2007 car dataset, and two self-collected car dataset and car viewpoint estimation with PASCAL VOC2006, PASCAL3D+

Semantic Segmentation Semantic Segmentation
	Wider or Deeper: Revisiting the ResNet Model for Visual Recognition [pdf] [slide] Zifeng Wu and Chunhua Shen and Anton van den Hengel	ARXIV 2016 Wu2016ARXIV

Motivation: Need to investigate whether to to make network architectures deeper or wider is the right strategy

Contributions:
- Analyses the ResNet architecture, in terms of the ensemble classifiers therein and the effective depths of the residual units
- Calculates a new, more spatially efficient and better performing architecture which achieves end-to-end training for large networks
- Designs a group of correspondingly shallow networks, shown to outperform very deep residual networks

Evaluates on PASCAL VOC, PASCAL Context, Cityscapes semantic image segmentation datasets & ImageNet classification dataset

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	Efficient Sparse-to-Dense Optical Flow Estimation using a Learned Basis and Layers [pdf] [slide] Wulff, Jonas and Black, Michael J.	CVPR 2015 Wulff2015CVPR

Representing optical flow as a weighted sum of the basis flow fields
Given a set of sparse matches, regressing to dense optical flow using a learned set of full-frame basis flow fields
Learning the principal components using flow computed from four Hollywood movies
Very fast (200ms/frame), but too smooth
Sparse layered flow, each layer is PCA-Flow (3.2s/frame)
Evaluated on Sintel and KITTI 2012 benchmarks

Tracking State-of-the-Art on KITTI
	Learning to Track: Online Multi-object Tracking by Decision Making [pdf] [slide] Yu Xiang and Alexandre Alahi and Silvio Savarese	ICCV 2015 Xiang2015ICCV

Online multi-object tracking (MOT)
Challenge: robustly associating noisy, new detections with previously tracked objects
Formulated as decision making in Markov Decision Processes (MDPs), where the lifetime of an object is modeled with a MDP
Data association (learning a similarity function) as learning a policy for the MDP as in reinforcement learning
Benefiting from both offline- and online-learning for data association
The birth/death and appearance/disappearance of targets by treating them as state transitions in the MDP
Better than the state-of-the-art on MOT Benchmark

Object Detection 2D Object Detection
	Subcategory-aware Convolutional Neural Networks for Object Proposals and Detection [pdf] [slide] Yu Xiang and Wongun Choi and Yuanqing Lin and Silvio Savarese	ARXIV 2016 Xiang2016ARXIV

Existing methods mainly focus on 2D object detection and cannot estimate detailed properties of objects and rely on region proposal methods to generate object candidates
Introduces a novel region proposal network that uses subcategory information to guide the proposal generating process
A subcategory can be objects with similar attributes such as 2D appearance, 3D pose or 3D shape
Introduces a newobject detection network for joint detection and subcategory classification
By using sub-categories related to object pose, they achieve state-of-the-art performance on both detection and pose estimation on commonly used benchmarks

Evaluates on KITTI, PASCAL3D+ and PASCAL VOC 2007 datasets

Semantic Segmentation Semantic Segmentation of Facades
	Image-based street-side city modeling [pdf] [slide] Jianxiong Xiao and Tian Fang and Peng Zhao and Maxime Lhuillier and Long Quan	SIGGRAPH 2009 Xiao2009SIGGRAPH

Proposes an automatic approach to generate street-side 3D photo-realistic models from images captured along the streets at ground level

Develops a multi-view semantic segmentation method that recognizes and segments each image at into semantically meaningful areas, each labeled with a specific object class, such as building, sky, ground, vegetation and car
A partition scheme is then introduced to separate buildings into independent blocks using the major line structures of the scene
For each block, proposes an inverse patch-based orthographic composition and structure analysis method for facade modeling that efficiently regularizes the noisy and missing reconstructed 3D data
System has the distinct advantage of producing visually compelling results by imposing strong priors of building regularity

Semantic Segmentation Semantic Segmentation of Facades
	Multiple view semantic segmentation for street view images. [pdf] [slide] Xiao, Jianxiong and Quan, Long	ICCV 2009 Xiao2009ICCV

Multi view semantic segmentation framework for images captured by a car driving along streets
Superpixel pairwise MRF over the entire sequence
Spatial and temporal smoothness of semantic labels
Boosting classifier as unary using image-based and geometric features from 3D reconstruction
Training speedup and quality improvement with adaptive training that selects most similar training data for each scene from label pool
Approach can be used for large-scale labeling in 2D and 3D space simultaneous
Demonstration on Google Street View images

Semantic Segmentation Label Propagation
	Semantic Instance Annotation of Street Scenes by 3D to 2D Label Transfer [pdf] [slide] Jun Xie and Martin Kiefel and Ming-Ting Sun and Andreas Geiger	CVPR 2016 Xie2016CVPR

Motivation for 3D to 2D Label Transfer:
- Objects often project into several images of the video sequence, thus lowering annotation efforts considerably.
- 2D instance annotations are temporally coherent as they are associated with a single object in 3D
- 3D annotations might be useful by themselves for reasoning in 3D or to enrich 2D annotations with approximate 3D geometry

Contributions:
- Present a novel geo-registered dataset of suburban scenes recorded by a moving platform
- Provides semantic 3D annotations for all static scene element
- Proposes a method transfer these labels from 3D into 2D, yielding pixelwise semantic instance annotations
- The dataset comprises over 400k images and over 100k laser scans

End-to-End Learning of Sensorimotor Control
	End-to-end Learning of Driving Models from Large-scale Video Dataset [pdf] [slide] Huazhe Xu and Yang Gao and Fisher Yu and Trevor Darrel	ARXIV 2016 Xu2016ARXIV

Learning a generic driving model/policy: Future egomotion prediction given the present state A probability distribution over actions conditioned on a state
Rule-based methods vs. learning-based approaches for autonomous driving
Limitations of learning a mapping from pixels to actuation: data collection
Using large scale online and/or crowdsourced dashcam videos
An end-to-end FCN-LSTM model: Fusing an LSTM temporal encoder with a fully con- volutional visual encoder
Joint driving demonstration and segmentation loss
Faster training and better results by using semantic segmentation as side task

Reconstruction Stereo
	Continuous Markov Random Fields for Robust Stereo Estimation [pdf] [slide] Yamaguchi, Koichiro and Hazan, Tamir and McAllester, David and Urtasun, Raquel	ECCV 2012 Yamaguchi2012ECCV

Slanted-plane model which reasons jointly about occlusion boundaries and depth
Existing slanting plane methods involved time-consuming optimization algorithms

Contributions:
- Novel model involving "boundary labels", "junction potentials" & "edge ownership"
- Faster inference by employing particle convex belief propagation (PCBP)
- More effective parameter training algorithm based on Primal-dual approximate inference

Evaluates on KITTI and Middebury high resolution images

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	Efficient joint segmentation, occlusion labeling, stereo and flow estimation [pdf] [slide] Yamaguchi, Koichiro and McAllester, David and Urtasun, Raquel	ECCV 2014 Yamaguchi2014ECCV

Existing slanting plane methods involved time-consuming optimization algorithms

Contributions:
- Exploits the fact that in autonomous driving scenarios most of the scene is static
- New SGM algorithm based on the joint evidence of the stereo and video pairs
- New fast block-coordinate descent form of inference algorithm on a total energy involving the segmentation, slanted planes and occlusion labeling

Evaluates on stereo and flow KITTI benchmarks
Order of magnitude faster than competing approaches

Motion & Pose Estimation 2D Motion Estimation -- Optical Flow
	Robust Monocular Epipolar Flow Estimation [pdf] [slide] K. Yamaguchi and D. McAllester and R. Urtasun	CVPR 2013 Yamaguchi2013CVPR

Limitations of existing algorithms:
- Gradient-based methods suffer in the presence of large displacements
- Matching-based methods are computationally demanding due to the large amount of candidates required

Contributions:
- Adapts slanted plane stereo models to the problem of monocular epipolar flow estimation
- Efficient flow-aware segmentation algorithm that encourages the segmentation to respect both image and flow discontinuities
- Robust data term using a new local flow matching algorithm

Evaluates on KITTI flow benchmark

Object Detection 2D Object Detection
	Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers [pdf] [slide] Fan Yang and Wongun Choi and Yuanqing Lin	CVPR 2016 Yang2016CVPR

Current approaches (Fast RCNN):
- Problems with small objects
- Not applicable to very deep architectures due to multi-scale input
- Other time constraints due to huge number candidate bounding boxes
Two new strategies to object detection using CNNs:
- Layer-wise cascaded rejection classifiers (CRC) to reject easy negatives in all layers
- Evaluating surviving proposals using scale-dependent pooling (SDP) Representing a candidate bounding box using the convolutional features pooled from a layer corresponding to its scale (height)
Better accuracy compared to state-of-the-art on PASCAL, KITTI, and newly collected Inner-city dataset

Tracking State-of-the-Art on KITTI
	Bayesian Multi-object Tracking Using Motion Context from Multiple [pdf] [slide] Ju Hong Yoon and Ming-Hsuan Yang and Jongwoo Lim and Kuk-Jin Yoon	WACV 2015 Yoon2015WACV

Online multi-object tracking with a single moving camera
2D conventional motion models no longer hold because of global camera motion
Consider motion context from multiple objects which describes the relative movement between objects
Construct a Relative Motion Network to factor out the effects of unexpected camera motion
It consists of multiple relative motion models that describe spatial relations between objects
Can be incorporated into various multi-object tracking frameworks and is demonstrated with a tracking framework based on a Bayesian filter
Evaluation on the ETHZ dataset

Semantic Segmentation Semantic Segmentation
	Multi-Scale Context Aggregation by Dilated Convolutions [pdf] [slide] Fisher Yu and Vladlen Koltun	ICLR 2016 Yu2016ICLR

Convolutional network module that is specifically designed for dense prediction (semantic segmentation)
Dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution
"The dilated convolution operator can apply the same filter at different ranges using different dilation factors."
Front end module: VGG16 with deconvolutions (FCN) by removing the last two pooling and striding layers
Front end is already too good: outperforms both FCN-8s and the DeepLab, and even DeepLab+CRF
Identity initialization for the context module
Trained on Microsoft COCO and VOC-2012 and tested on VOC-2012

Motion & Pose Estimation Localization
	Semantic alignment of LiDAR data at city scale [pdf] [slide] Fisher Yu and Jianxiong Xiao and Thomas A. Funkhouser	CVPR 2015 Yu2015CVPR

Alignment of LiDAR data collected with Google Street View cars in urban environments
Problems with current approaches:
- GPS do not work well in city environments with tall buildings
- Local tracking techniques (integration of inertial sensors, SfM, etc.) drift over long ranges, causing warped and misaligned data by many meters
Approach: semantic features with object detectors (for facades, poles, cars, etc.) that
- can be matched robustly at different scales
- are selected for different iterations of an ICP algorithm
Better than baselines on data from New York, San Francisco, Paris, and Rome

Scene Understanding
	Understanding High-Level Semantics by Modeling Traffic Patterns [pdf] [slide] Hongyi Zhang and Andreas Geiger and Raquel Urtasun	ICCV 2013 Zhang2013ICCV

Understanding the semantics of outdoor scenes in the context of autonomous driving
Generative model of 3D urban scenes enables to reason about high level semantics in form of traffic patterns
Learn the traffic patterns from real scenarios
Novel object likelihood which models lanes much more accurately and improves the estimation of parameters such as the street orientations
Small number of patterns is sufficient to model the vast majority of traffic scenes
High-level reasoning significantly improves the overall scene estimation as well as the vehicle-to-lane association

Motion & Pose Estimation Ego-Motion Estimation
	Visual-lidar odometry and mapping: low-drift, robust, and fast [pdf] [slide] Ji Zhang and Sanjiv Singh	ICRA 2015 Zhang2015ICRA

Combining visual and lidar odometry in a fundamental and first principle method
Visual odometry to estimate the ego-motion and to register point clouds from a scanning lidar at a high frequency but low fidelity
Scan matching based lidar odometry to refine the motion estimation and point cloud registration simultaneously
Ranking first on the KITTI odometry benchmark
Further experiments with a wide-angle camera and a fisheye camera
Robust to aggressive motion and temporary lack of visual features

Motion & Pose Estimation Ego-Motion Estimation
	LOAM: Lidar Odometry and Mapping in Real-time [pdf] [slide] Ji Zhang and Sanjiv Singh	RSS 2014 Zhang2014RSS

A real-time odometry and mapping method from a 2-axis lidar moving in 6-DOF
Problems:
- Range measurements received at different times
- Mis-registration of the point cloud due to the errors in motion estimation
Current approaches: 3D maps by offline batch methods, using loop closure for drift
Both low-drift and low-computational complexity without the need for high accuracy ranging or inertial measurements
Division of the complex problem of simultaneous localization and mapping:
- Odometry at a high frequency but low fidelity to estimate velocity of the lidar
- Fine matching and registration of the point cloud at a frequency of an order of magnitude lower
Tested both indoor and outdoor, state-of-the art accuracy in real-time on KITTI odometry benchmark

Tracking
	Global Data Association for Multi-Object Tracking Using Network Flows [pdf] [slide] L. Zhang and Y. Li and R. Nevatia	CVPR 2008 Zhang2008CVPR

Existing methods severely limit the search window and perform pruning of hypotheses

Contributions:
- Presents a novel data association framework for multiple object tracking that optimizes the association globally using all the observations from the entire sequence
- False alarms, initialization and termination of the trajectory & inference of occlusions is modeled intrinsically in the method
- An optimal solution is provided based on the min-cost network flow algorithms

Evaluates on the CAVIAR videos and the ETH Mobile Scene (ETHMS) datasets

Semantic Segmentation Semantic Segmentation
	Pyramid Scene Parsing Network [pdf] [slide] Hengshuang Zhao and Jianping Shi and Xiaojuan Qi and Xiaogang Wang and Jiaya Jia	ARXIV 2016 Zhao2016ARXIV

Assign each pixel in the image a category label
Pyramid Scene parsing network provides a superior framework for pixel-level prediction tasks
Different-region-based context aggregation through pyramid pooling module
Global prior representation is effective to produce good quality results
Evaluation on ImageNet scene parsing challenge 2016, PASCAL VOC 2012 and Cityscapes benchmark

Reconstruction Multi-view 3D Reconstruction
	Exploiting Object Similarity in 3D Reconstruction [pdf] [slide] Chen Zhou and Fatma Gney and Yizhou Wang and Andreas Geiger	ICCV 2015 Zhou2015ICCV

Challenges: low frame rates, occlusions, large distortions, and difficult lighting conditions
Learning volumetric shape models for objects of similar type such as vehicles, buildings to complete missing surfaces and improve the reconstruction
Initial reconstruction by SfM and volumetric fusion using TSDF
3D object detection by exemplar SVMs on TSDF representation
BCD for joint inference of different blocks:
- Optimization of object poses
- Assigning proposals to shape models
- Learning shape model parameters
Improving compared to the initial reconstruction and PMVS2, especially in completeness
A novel multi-view reconstruction dataset from fisheye cameras

Object Detection 3D Object Detection from 2D Images
	Detailed 3D Representations for Object Recognition and Modeling [pdf] [slide] Zia, M.Z. and Stark, M. and Schiele, B. and Schindler, K.	PAMI 2013 Zia2013PAMI

Combines detailed models of 3D geometry with modern discriminative appearance models into a richer and more fine-grained object representation

Method overview:
- Starts from a database of 3D computer aided design (CAD) models of the desired object class as training data
- Applies principal components analysis to obtain a coarse 3-dimensional wireframe model which captures the geometric intra-class variability
- Trains detectors for the vertices of the wireframe, which they call `parts'
- At test time, generates evidence for the parts by densely applying the part detectors to the image
- Explores the space of possible object geometries and poses by guided random sampling from the shape model, in order to identify the ones that best agree with the image evidence

Evaluates on 3D Object Classes and EPFL Multi-view cars datasets

Object Detection 3D Object Detection from 2D Images
	Towards Scene Understanding with Detailed 3D Object Representations [pdf] [slide] Zia, M.Zeeshan and Stark, Michael and Schindler, Konrad	IJCV 2015 Zia2015IJCV

Simple object representations such as bounding boxes used so far for semantic image and scene understanding
Propose to base scene understanding on a high-resolution object representation
Object class (cars) are modeled as a deformable 3D wireframe
Viewpoint-invariant method for 3D reconstruction of severely occluded objects
From single view joint estimation of the shapes and poses of multiple objects
Reconstruct scenes in a single inference framework including geometric constraints between the objects
Leverage rich detail of the 3D representation for occlusion reasoning at the individual vertex level
Ground plane is estimated by consensus among different objects
Systematic evaluation on KITTI dataset

History of Autonomous Driving Autonomous Driving Projects
	Making Bertha Drive - An Autonomous Journey on a Historic Route [pdf] [slide] Julius Ziegler and Philipp Bender and Markus Schreiber and Henning Lategahnf	ITSM 2014 Ziegler2014ITSM

Gives an overview of the autonomous vehicle which completed the route from Mannheim to Pforzheim, Germany, in fully autonomous manner
The autonomous vehicle was equipped with close-to-production sensor hardware in terms of cost and technical maturity than in many autonomous robots presented earlier
Presents details on vision and radar-based perception, digital road maps and video-based self-localization, as well as motion planning in complex urban scenarios

The key features of the system are:
- Radar and stereo vision sensing for object detection and free-space analysis
- Monocular vision for traffic light detection and object classification
- Digital road maps complemented with vision-based map-relative localization
- Versatile trajectory planning and reliable vehicle control

This webpage is still under construction and will be finalized in May 2017!

Computer Vision for Autonomous Vehicles: Problems, Datasets and State-of-the-Art