(cache) BID Data Project | Big Data Analytics with Small Footprint

Welcome to the BID Data Project! Here you will find resources for the fastest Big Data tools on the Web. See our Benchmarks on github. BIDMach running on a single GPU-equipped host holds the records for many common machine learning problems, on single nodes or clusters.

Try It! BIDMach is an interactive environment designed to make it extremely easy to build and use machine learning models. BIDMach runs on Linux, Windows 7&8, and Mac OS X, and we have a pre-loaded Amazon EC2 instance. See the instructions in the Download Section.

Develop with it. BIDMach includes core classes that take care of managing data sources, optimization and distributing data over CPUs or GPUs. It’s very easy to write your own models by generalizing from the models already included in the Toolkit.

Explore. Our Publications Section includes published reports on the project, and the topics of forthcoming papers.

Contribute. BIDMach includes many popular machine learning algorithms. But there is much more work to do. In progress we have Random Forests, extremely fast Gibbs samplers for Bayesian graphical models, distributed Deep Learning networks, and graph algorithms. Ask us for an unpublished report on these topics.

Lightning Overview

The BID Data Suite is a collection of hardware, software and design patterns that enable fast, large-scale data mining at very low cost.

Architecture of the Toolkit

The elements of the suite are:

Hardware. The data engine that balances storage, CPU and GPU acceleration for typical data mining workloads.
Software.
- BIDMat, an interactive matrix library that integrates CPU and GPU acceleration and novel computational kernels.
- BIDMach, a machine learning system that includes very efficient model optimizers and mixing strategies.
Scaling Up.
- Butterfly Mixing, a communication strategy that hides the latency of frequent model updates needed by fast optimizers for clusters.
- Sparse AllReduce, an efficient MapReduce like primitive for scalable communication of power-law data.

In the benchmark section, we present several benchmark problems to show how the above elements combine to yield multiple orders-of-magnitude improvements for each problem.

Leave a Reply Cancel reply