There is a large selection of machine learning frameworks: All of them offer different advantages and disadvantages. As always, the particular project is crucial.
If you want to do machine learning today, you cannot ignore the large open source frameworks. After a brief research on the Internet, websites overwhelm you with articles such as “Top 10 Machine Learning Frameworks” or “The Best Open Source Tools for ML”: a jungle that is only sometimes worth fighting through.
In addition to the programming languages used and the setup, information about the license, support and the community are important. A framework with many facets, the use of which the licenses prevent, is just as useful as one that has not received any support for weeks or whose community is inactive. Therefore, the following framework comparison offers an overview of common machine learning frameworks, all of which are available under open source licenses. We present 12 of the ML platforms as follows.
- Apache Spark MLlib Formerly known as part of the Hadoop universe, Apache Spark is now a well-known machine learning framework. Its extensive range of algorithms is constantly being revised and expanded.
- Apache Singa Singa, recently part of the Apache Incubator, is an open source framework that is supposed to "train" deep learning mechanisms for large data volumes. Singa provides a simple programming model for deep learning networks and supports various development routines.
- Caffe Caffe comprises a whole set of freely available reference models for common classification routines; the grown caffe community contributes further models. Caffe supports the Nvidia programming technology CUDA, with which program parts can optionally also be processed by the graphics processor (GPU).
- Microsoft Azure ML Studio Because the cloud is the ideal environment for ML applications, Microsoft has equipped its Azure cloud with its own ML service based on "pay as you go": With Azure ML Studio, users can develop and train and train AI models then convert them into APIs in order to make them available to others.
- Amazon Machine Learning Amazon Machine Learning works with data that is in an Amazon cloud such as S3, Redshift or RDS and can build new AI models using binary classifications and multi-class categorization of given data.
- Microsoft DMTK Microsoft's DMTK (Distributed Machine Learning Toolkit) is designed to scale ML applications across multiple machines. It is intended more as an "out of the box" solution and less as a framework - the number of supported algorithms is correspondingly small.
- Google TensorFlow TensorFlow is based on so-called data flow graphs, in which bundles of data ("tensors") are processed by a series of algorithms that are described by a graph. The movement patterns of the data within the system are called "flows". The graphs can be assembled using C ++ and Python and processed via CPU or GPU.
- Microsoft CNTK The Microsoft Computational Network Toolkit works similarly to Google TensorFlow: Neural networks can be generated using directed graphs. According to Microsoft's own description, CNTK can also be compared with projects like Caffe, Theano and Torch - but it is faster and, in contrast to those mentioned, can even access processor and graphics processor performance in parallel.
- Samsung Veles The Samsung framework is intended to analyze and automatically normalize data sets before they go into productive operation - which in turn is immediately possible thanks to its own API called REST - provided the hardware used has sufficient power. The use of Python in Veles also includes its own analysis and visualization tool called Jupyter (formerly IPython) for displaying individual application clusters.
- Brainstorm Brainstorm relies on Python to provide two data management APIs (called "handers") - one for CPU processing through the library "Numpy" and one for GPU processing through CUDA. A user-friendly GUI is in the works.
- mlpack 2 The new version of the machine learning library mlpack, written in C ++ and first published in 2011, brings a lot of innovations - including new algorithms and revised old ones.
- Marvin Marvin's source code is very clear - the pre-trained models included (see picture) allow extensive further development.
Apache Spark MLlib
Apache Spark is probably best known as part of the Hadoop family; The data processing framework was created on an in-memory basis but outside of the Hadoop universe. Spark is now making a name for itself as a machine learning tool thanks to its ever-growing library of algorithms that can process in-memory data at high speed.
The Spark algorithms are constantly being revised and expanded, last year's release of version 1.5 brought a lot of new code, as well as MLlib support for Python. Spark 1.6 is currently out, in which, among other things, ML processes can be interrupted and continued for the first time thanks to stable pipelines. spoods.de
Deep learning frameworks support advanced ML functions such as natural language processes and image recognition. Singa, recently part of the Apache Incubator, is an Open -Source framework that is supposed to "train" deep learning mechanisms for large data volumes. Singa provides a simple programming model for deep learning networks and supports various development routines such as Convolutional Neural Network, Restricted Boltzmann machine and recurrent neural networks. Models can be trained clock-synchronously (one after the other) or asynchronously (next to each other), depending on the case. Apache Singa also simplifies the cluster setup with Apache Zookeeper.
The deep learning framework Caffe was founded in 2013 by California students as the Machine Vision Project. Since then, new applications have been added to it. Since Caffe attaches great importance to speed, it is completely written in C ++ and supports the Nvidia programming technology CUDA, with which program parts are processed by the graphics processor (GPU ) can be processed. This allows the user to freely decide whether the main processor (CPU) or GPU should be addressed. Caffe comprises a whole set of freely available reference models for common classification routines; the grown Caffe community contributes further models.
Microsoft Azure ML Studio
To practice machine learning, computing power is required. And where can you find more computing power than in the cloud? Because the cloud is the ideal environment for ML applications, Microsoft has its Azure cloud with a own ML service based on "pay as you go": With Azure ML Studio users can develop and train AI models and then convert them into APIs in order to make them available to others.
Microsoft's project HowOldRobot, for example, was completely open developed this way. 10 GB of storage space are available per account - the link to the existing Azure storage is for larger projects but possible without any problems. Microsoft provides a larger selection of algorithms - both its own and those from third parties. You don't even need an account to test Azure ML Studio: If you want, you can dial in anonymously and try it out to your heart's content for eight hours.
Amazon Machine Learning
Amazon's approach to cloud services follows a certain pattern: provide the base, get a core audience that is interested, let them do the building work, and then find out what users really need to deliver it.
It is similar with Amazons excursion into the world of machine learning. Amazon Machine Learning works with data stored in an Amazon cloud such as S3, Redshift or RDS and can build new AI models with the help of binary classifications and multi-class categorization of given data. Of course, the entire service is very Amazon-heavy - only data stored on Amazon can be processed, an import / export function is not available, the developed models must not be larger than 100 GB.
Microsoft Distributed Machine Learning Toolkit
The more computers busy solving an ML problem, the better. However, it is not always easy to interconnect several machines and develop ML applications that function smoothly on this network. Microsoft's DMTK (Distributed Machine Learning Toolkit) addresses this problem by providing different types of ML routines through a system Distributed across clusters.
DMTK is more of a framework than an "out-of-the-box solution", the number of algorithms included is correspondingly small. The design allows a later expansion in all directions - each cluster node has a local buffer that noticeably reduces the data volume on the central server that manages parameters for running routines.
Just like Microsoft's DMTK, Google TensorFlow should scale across multiple machines. Initially it was planned as an internal Google tool, but was quickly made publicly available as an open source project. TensorFlow is based on so-called data flow graphs in which bundles of data ("tensors") are processed by a series of algorithms that are described by a graph. The movement patterns of the data within the system are called "flows". The graphs can be assembled using C ++ and Python and processed via CPU or GPU. Google's plan is to have TensorFlow developed further by third parties.
Microsoft Computational Network Toolkit
In the course of the DMTK release Microsoft brought out a second ML framework, the Computational Network Toolkit - CNTK for short. It works in a similar way to Google TensorFlow: Neural networks can be generated by directed graphs. According to Microsoft's own description, CNTK can also be compared with projects such as Caffe, Theano and Torch - but it is faster and, unlike the ones mentioned, can even access processor and graphics processor performance in parallel.
The CTNK framework in connection with multi-GPU resources in the Azure cloud should, according to Microsoft, for example ensured that the development of the Cortana voice assistance system went much faster than originally thought. CNTK was finally created as part of Microsoft's research on speech recognition systems and was first published in April 2015 under the Open Source license. It is now generally available on GitHub under a much freer MIT-like license.
Veles is a decentralized platform for deep learning applications such as TensorFlow and DMTK, written in C ++ with a "touch" of Python for automation and coordination routines. It is intended to analyze and automatically normalize data sets before they go into productive operation - which in turn is immediately possible thanks to its own API called REST - provided that the hardware used has enough power. The use of Python in Veles also includes its own analysis and visualization tool called Jupyter (formerly IPython) for displaying individual application clusters. The ML platform operated by Samsung will soon be made available under the Open Source license in order to enable future developments - for example interfaces to Windows and Mac OS X - to accelerate. Veles will soon become open source.
Developed last year by doctoral students Klaus Greff and Rupesh Srivastava at IDSIA (Institute Dalle Molle for Artifical Intelligence) in Lugano, Switzerland, the project ["Brainstorm"](https://github.com/IDSIA/brainstorm "" Brainstorm \ "") "make neural networks fast, flexible and fun". A number of recurrent neural networks such as LSTM are already supported. Brainstorm relies on Python to provide two data management APIs (called "handers") - one for CPU processing through the "Numpy" library and one for GPU processing through CUDA. Most of the work is done by Python scripts - so don't expect an easy-to-use interface. The long-term plan provides for its own GUI, which is multi-platform capable and should take into account empirical values from previous open source projects.
The new version of the machine learning library mlpack written in C ++, which was first published in 2011, brings a lot of innovations - including new algorithms and revised old ones. Unfortunately, mlpack2 still does not support any other programming language than C ++ developers, who rely on R or Python, cannot use the library as long as no one has mercy and publishes suitable mlpack "converters". At least MATLAB supported is still available, but this does not play a major role in the ML environment.
Marvin, a framework for neural networks, is still relatively new and comes from the Princeton Vision Group. It was "born to be hacked" - according to its creator in the project documentation - and is based on a few lines of code that were written in C ++ and in the CUDA-GPU framework. But even if the program code is minimalist, Marvin brings along some pre-trained models that can be used meaningfully.