Tuesday, May 12, 2015

Machine Learning as Service

Machine Learning as a Service

12 MAY 2015

beautiful header

The machine learning (ML) industry continues to grow apace, and several tools have emerged which give access to advanced learning algorithms with minimal effort. Whether for personal or business gain, machine learning is becoming a service industry, available on-demand, for everyone.

This 'feed in data, get answer back' approach can certainly be a nice alternative to implementing a full ML solution. There is no implementation fuss, and you can trust the tools you use, as they are built by some of the best in the ML field.

ML can be of use to anyone. However, if those with limited knowledge in the field are benefiting from this industry, how much more can those with experience in ML gain from these services?

With this question in mind, I set out to try some machine learning services. My goal is to understand what advantages these services bring compared to a home-made solution.

Others can benefit from this knowledge, so I am sharing what I learn with you in this series.

I am currently experimenting three services: Amazon Machine Learning ServiceGoogle Prediction API, and Microsoft Azure Machine Learning Studio.

There are other machine learning services available. These three, however, also provide a variety of other services such as storage and deployment, making them possible all-in-one solutions for many applications.

My main focus for now is on the services' functionalities. In this post, we are covering data uploading and preprocessing. I will soon post on model training and model evaluation as well.

Other aspects such as integration and scalability are not going to be covered, though they may be in the future.

All tests were performed using the services' consoles. Some functionalities are available only when using the services' API, and such cases will be identified.

Most of the comparison results are presented in tables as the one shown below. I believe this to be a better way for readers to process the results of the benchmarking than wordy descriptions.

Aspects in need of further clarification will be described in more detail.

I will summarise what aspects of data sourcing and preprocessing should be considered when deciding upon a service.1 I will also present some of my thoughts on the matter.2

Data Sourcing

Below is a summary of the aspects I considered during data sourcing. These include data sources, formats, and maximum size, as well as supported data types.

data sourcing

All three services can train models on uploaded text files. Both AWS and MS Azure can also read data from tables in their storage services.

AWS supports the largest datasets for batch training.

Google supports update calls, which one can use to incrementally train the model - that is, to do online training.

MS Azure supports the widest variety of data sources and formats.

There is no clear winner for now, as each service has its strengths and weaknesses.

Data Preprocessing

The table below lists whether certain data preprocessing operations can be performed using these services. The operations covered here are commonly used, but this is not an exhaustive list of all preprocessing techniques.

Keep in mind that you can perform most, if not all these operations using Python or some other language before sending the data to the service. What is being assessed here is whether these operations can be performed within the service.

data preprocessing

It may happen that some transformations are performed behind the scenes, before the actual training takes place. In this table we are referring to explicitly applying the transformations on the data.

In AWS, most operations are performed using the so-called recipes. Recipes are JSON-like scripts used to transform the data before feeding it to a machine learning model.

All the above transformations other than data visualization, data splitting and missing value imputation are done using recipes. For instance, quantile_bin(session_length, 5) would discretize the session_length variable into 5 bins. You can also perform the operations to groups of variables; groups themselves are also defined in the recipes.

Missing value imputation is also indicated as being possible within AWS. Although the transformation is not directly implemented, one can train a simple model - a linear regression, for instance - to predict the missing values. This model can then be chained with the main model. For this reason, I consider AWS as allowing missing value imputation.

In MS Azure, transformations are applied sequentially using the built-in modules. The binning example above could be done using the 'Quatize Data' module. One can choose which variable or variables are affected.

R and Python scripts can also be included to apply custom transformations.

When using Google, most of the data processing will have to be done before feeding the data to the service.

Strings with more than one word are separated into multiple features within Google Prediction API. 'load the data' would be split into 'load', 'the', and 'data'. This type of processing is common in Natural Language Processing (NLP) applications such as document summarization and text translation.

You may choose to make all the data processing before sending the data to any of these services. Though this may mean more work, it is also a way to give you more control - you know exactly what you are doing to your data.

Aspects to Consider

Which service works best for your application? For now, the short answer really is 'it depends'. A number of factors needs to be considered:

These services support data loaded form their own storage services, so how you store your data can prove to be a decisive factor.

Can you handle batch training? If yes, evaluate the typical size of your dataset. On the other hand, if your dataset is really large, or if you want to keep updating the model as you go, consider online training.

When implementing data transformation tools on your side, not having any built-in data transformation tools may not be a problem at all.

If that is not a possibility, know which transformations you need to perform on your data, and understand whether the service you choose offers them. Pay special attention to missing values and text features, as typical application data are sure to have both.

Final Thoughts

Personally, I found MS Azure's flexibility both in data sourcing and preprocessing attractive. I did not use the custom R or Python scripts, mostly because I did not need to.

However, I do like to know exactly what I am doing to the data I feed a model with. Although I was able to quickly transform data using MS Azure, I would still do the data transformation using my own tools. This gives me full control, and allows me to exploit my data's specific traits to perform operations in the most efficient way.

Google provides what I believe to be a key feature in ML applications: incremental training. It allows you to use virtually infinite data. It takes the weight of assessing when to retrain a model off your shoulders.

When it comes to data processing, Amazon lies somewhere between the other two: it has some functionalities, but not many. But given how recent this service is - it was launched little more than a month ago - I see potential. If the service continues to evolve, it may become a very versatile tool.

Data processing is just the beginning, though. I find it too early to make a final decision.

Credit Source: InĂªs Almeida

2 comments: