What is this Framework?
The Datumbox Machine Learning Framework is an open-source framework written in Java which enables the rapid development of Machine Learning models and Statistical applications. It is the code that currently powers up the Datumbox API. The main focus of the framework is to include a large number of machine learning algorithms & statistical methods and be able to handle small-medium sized datasets. Even though the framework targets to assist the development of models from various fields, it also provides tools that are particularly useful in Natural Language Processing and Text Analysis applications.
What types of models/algorithms are supported?
The framework is divided in several Layers such as Machine Learning, Statistics, Mathematics, Algorithms and Utilities. Each of them provides a series of classes that are used for training machine learning models. The two most important layers are the Statistics and the Machine Learning layer.
The Statistics layer provides classes for calculating descriptive statistics, performing various types of sampling, estimating CDFs and PDFs from commonly used probability distributions and performing over 35 parametric and non-parametric tests. Such types of classes are usually necessary while performing explanatory data analysis, sampling and feature selection.
The Machine Learning layer provides classes can be used in a large number of problems including Classification, Regression, Cluster Analysis, Topic Modeling, Dimensionality Reduction, Feature Selection, Ensemble Learning and Recommender Systems. Here are some of the supported algorithms: LDA, Max Entropy, Naive Bayes, SVM, Bootstrap Aggregating, Adaboost, Kmeans, Hierarchical Clustering, Dirichlet Process Mixture Models, Softmax Regression, Ordinal Regression, Linear Regression, Stepwise Regression, PCA and more.
Datumbox Framework VS Mahout VS Scikit-Learn
Both Mahout and Scikit-Learn are great projects and both of them have completely different targets. Mahout supports only a very limited number of algorithms which can be parallelized and thus use Hadoop’s Map-Reduce framework to handle Big Data. On the other hand Scikit-Learn supports a large number of algorithms but it can’t handle huge amount of data. Moreover it is developed in Python, which is a great language for prototyping and Scientific Computing but not my personal favourite for software development.
The Datumbox Framework sits in the middle of the two solutions. It tries to support a large number of algorithms and it is written in Java. This means that it can be incorporated easier into production code, it can easier be tweaked to reduce memory consumption and it can be used in real time systems. Finally even though currently Datumbox Framework is capable of handling medium-sized datasets, it is within my plans to expand it to handle large-sized datasets.
How stable is it?
The early versions of the framework (up to 0.3.x) were developed in August and September of 2013 and they were written in PHP (yeap!). During May and June 2014 (versions 0.4.x), the framework was rewritten in Java and enhanced with additional features. Both branches were heavily tested in commercial applications including the Datumbox API. The current version is 0.5.0 and it seems mature enough to be released as the first public alpha version of the framework. Having said that, it is important to note that some functionalities of the framework are tested more thoroughly than others. Moreover since this version is alpha, you should expect drastic changes on the future releases.
Why I wrote it and why I open-source it?
My involvement with Machine Learning and NLP dates back to 2009 when I co-founded WebSEOAnalytics.com. Since then I have been developing implementations of various machine learning algorithms for various projects and applications. Unfortunately most of the original implementations were very problem-specific and they could hardly be used in any other problem. In August 2013 I decided to start Datumbox as a personal project and develop a framework that provides the tools for developing machine learning models focusing in the area of NLP and Text Classification. My target was to build a framework that would be reused on the future for developing quickly machine learning models, incorporating it in projects that require machine learning components or offer it as a service (Machine Learning as a Service).
And here I am now, several lines of code later, open-sourcing the project. Why? The honest answer is that at this point, it is not within my plans to go through a “let’s build a new start-up” journey. At the same time I felt that keeping the code on my hard disk in case I need it on the future does not make sense. So the only logical thing to do was to open-source it. 🙂
If you read the previous two paragraphs, you should probably seen this coming. Since the framework was not developed having in mind that I would share it with others, the documentation is poor/non-existent. Most of the classes and public methods are not properly commented and there is no document describing the architecture of the code. Fortunately all the class names are self-explanatory and the framework provides JUnit tests for every public method & algorithm and these can be used as examples of how to use the code. I hope that with the help of the community we will build a proper documentation, so I am counting on you!
Current Limitations and Future Development
As in every piece of software (and especially the open-source projects in alpha version), the Datumbox Machine Learning Framework comes with its own unique and adorable limitations. Let’s dig into them:
- Documentation: As mentioned earlier, the documentation is poor.
- No Multithreading: Unfortunately the framework does not currently support Multithreading. Of course we should note that not all machine learning algorithms can be parallelized.
- Code Examples: Since the framework has just been published, you can’t find any code examples on the web other than those provided by the framework in the form of JUnit tests.
- Code Structure: Creating a solid architecture for any large project is always challenging, let alone when you have to deal with Machine Learning algorithms that differ significantly (supervised learning, unsupervised learning, dimensionality reduction algorithms etc).
- Model Persistence and Large Data Collections: Currently the models can be trained and stored either on files on disk or in MongoDB databases. To be able to handle large amount of data, other solutions must be investigated. For example MapDB seems like a good candidate for storing data and parameters while training. Moreover it is important to remove any 3rd party libraries that currently handle the persistence of the models and develop a better dry and modular solution.
- New algorithms/tests/models: There are so many great techniques that are not currently supported (especially for time series analysis).
Unfortunately all the above are too much work and there is so little time. That is why if you are interested in the project, step forward and give me a hand with any of the above. Moreover I would love to hear from people who have experience in open-sourcing medium-large projects and could provide any tips on how to manage them. Additionally I would be grateful to any brave soul who would dare to look into the code and document some classes or public methods. Last but not least if you use the framework for anything interesting, please drop me a line or share it with a blog post.
Finally I would like to thank my love Kyriaki for tolerating me while writing this project, my friend and super-ninja-Java-developer Eleftherios Bampaletakis for helping out with important Java issues and you for getting involved in the project. I’m looking forward to your comments.