Welcome to the BID Data Project! Here you will find resources for the fastest Big Data tools on the Web. See our Benchmarks on github. BIDMach running on a single GPU-equipped host holds the records for many common machine learning problems, on single nodes or clusters.
Try It! BIDMach is an interactive environment designed to make it extremely easy to build and use machine learning models. BIDMach runs on Linux, Windows 7&8, and Mac OS X, and we have a pre-loaded Amazon EC2 instance. See the instructions in the Download Section.
Develop with it. BIDMach includes core classes that take care of managing data sources, optimization and distributing data over CPUs or GPUs. It’s very easy to write your own models by generalizing from the models already included in the Toolkit.
Explore. Our Publications Section includes published reports on the project, and the topics of forthcoming papers.
Contribute. BIDMach includes many popular machine learning algorithms. But there is much more work to do. In progress we have Random Forests, extremely fast Gibbs samplers for Bayesian graphical models, distributed Deep Learning networks, and graph algorithms. Ask us for an unpublished report on these topics.
Lightning Overview
The BID Data Suite is a collection of hardware, software and design patterns that enable fast, large-scale data mining at very low cost.
The elements of the suite are:
- Hardware. The data engine that balances storage, CPU and GPU acceleration for typical data mining workloads.
- Software.
- Scaling Up.
- Butterfly Mixing, a communication strategy that hides the latency of frequent model updates needed by fast optimizers for clusters.
- Sparse AllReduce, an efficient MapReduce like primitive for scalable communication of power-law data.
In the benchmark section, we present several benchmark problems to show how the above elements combine to yield multiple orders-of-magnitude improvements for each problem.