Tuesday, May 12, 2015

Machine Learning as Service

Machine Learning as a Service

12 MAY 2015

beautiful header

The machine learning (ML) industry continues to grow apace, and several tools have emerged which give access to advanced learning algorithms with minimal effort. Whether for personal or business gain, machine learning is becoming a service industry, available on-demand, for everyone.

This 'feed in data, get answer back' approach can certainly be a nice alternative to implementing a full ML solution. There is no implementation fuss, and you can trust the tools you use, as they are built by some of the best in the ML field.

ML can be of use to anyone. However, if those with limited knowledge in the field are benefiting from this industry, how much more can those with experience in ML gain from these services?

With this question in mind, I set out to try some machine learning services. My goal is to understand what advantages these services bring compared to a home-made solution.

Others can benefit from this knowledge, so I am sharing what I learn with you in this series.

I am currently experimenting three services: Amazon Machine Learning ServiceGoogle Prediction API, and Microsoft Azure Machine Learning Studio.

There are other machine learning services available. These three, however, also provide a variety of other services such as storage and deployment, making them possible all-in-one solutions for many applications.

My main focus for now is on the services' functionalities. In this post, we are covering data uploading and preprocessing. I will soon post on model training and model evaluation as well.

Other aspects such as integration and scalability are not going to be covered, though they may be in the future.

All tests were performed using the services' consoles. Some functionalities are available only when using the services' API, and such cases will be identified.

Most of the comparison results are presented in tables as the one shown below. I believe this to be a better way for readers to process the results of the benchmarking than wordy descriptions.

Aspects in need of further clarification will be described in more detail.

I will summarise what aspects of data sourcing and preprocessing should be considered when deciding upon a service.1 I will also present some of my thoughts on the matter.2

Data Sourcing

Below is a summary of the aspects I considered during data sourcing. These include data sources, formats, and maximum size, as well as supported data types.

data sourcing

All three services can train models on uploaded text files. Both AWS and MS Azure can also read data from tables in their storage services.

AWS supports the largest datasets for batch training.

Google supports update calls, which one can use to incrementally train the model - that is, to do online training.

MS Azure supports the widest variety of data sources and formats.

There is no clear winner for now, as each service has its strengths and weaknesses.

Data Preprocessing

The table below lists whether certain data preprocessing operations can be performed using these services. The operations covered here are commonly used, but this is not an exhaustive list of all preprocessing techniques.

Keep in mind that you can perform most, if not all these operations using Python or some other language before sending the data to the service. What is being assessed here is whether these operations can be performed within the service.

data preprocessing

It may happen that some transformations are performed behind the scenes, before the actual training takes place. In this table we are referring to explicitly applying the transformations on the data.

In AWS, most operations are performed using the so-called recipes. Recipes are JSON-like scripts used to transform the data before feeding it to a machine learning model.

All the above transformations other than data visualization, data splitting and missing value imputation are done using recipes. For instance, quantile_bin(session_length, 5) would discretize the session_length variable into 5 bins. You can also perform the operations to groups of variables; groups themselves are also defined in the recipes.

Missing value imputation is also indicated as being possible within AWS. Although the transformation is not directly implemented, one can train a simple model - a linear regression, for instance - to predict the missing values. This model can then be chained with the main model. For this reason, I consider AWS as allowing missing value imputation.

In MS Azure, transformations are applied sequentially using the built-in modules. The binning example above could be done using the 'Quatize Data' module. One can choose which variable or variables are affected.

R and Python scripts can also be included to apply custom transformations.

When using Google, most of the data processing will have to be done before feeding the data to the service.

Strings with more than one word are separated into multiple features within Google Prediction API. 'load the data' would be split into 'load', 'the', and 'data'. This type of processing is common in Natural Language Processing (NLP) applications such as document summarization and text translation.

You may choose to make all the data processing before sending the data to any of these services. Though this may mean more work, it is also a way to give you more control - you know exactly what you are doing to your data.

Aspects to Consider

Which service works best for your application? For now, the short answer really is 'it depends'. A number of factors needs to be considered:

These services support data loaded form their own storage services, so how you store your data can prove to be a decisive factor.

Can you handle batch training? If yes, evaluate the typical size of your dataset. On the other hand, if your dataset is really large, or if you want to keep updating the model as you go, consider online training.

When implementing data transformation tools on your side, not having any built-in data transformation tools may not be a problem at all.

If that is not a possibility, know which transformations you need to perform on your data, and understand whether the service you choose offers them. Pay special attention to missing values and text features, as typical application data are sure to have both.

Final Thoughts

Personally, I found MS Azure's flexibility both in data sourcing and preprocessing attractive. I did not use the custom R or Python scripts, mostly because I did not need to.

However, I do like to know exactly what I am doing to the data I feed a model with. Although I was able to quickly transform data using MS Azure, I would still do the data transformation using my own tools. This gives me full control, and allows me to exploit my data's specific traits to perform operations in the most efficient way.

Google provides what I believe to be a key feature in ML applications: incremental training. It allows you to use virtually infinite data. It takes the weight of assessing when to retrain a model off your shoulders.

When it comes to data processing, Amazon lies somewhere between the other two: it has some functionalities, but not many. But given how recent this service is - it was launched little more than a month ago - I see potential. If the service continues to evolve, it may become a very versatile tool.

Data processing is just the beginning, though. I find it too early to make a final decision.

Credit Source: InĂªs Almeida

Sunday, April 12, 2015

Big Data for Security to defend against APT and Zero day

According to Gartner, Big data will change cyber security in network monitoring, identity management, fraud detection,governance, compliance. I listed the following 8 companies (without order of preference) in using big data to defeat zero day and APT attack. The Big data and Cyber Security is at the hyper cycle of growth, I believe that there are at least 50 other companies (big or small or even the startup in stealth modes) are working on a new killer app for using Machine Learning, AI, Deepnet, and Big data to keep ahead of hackers. So, I welcome any comments and please add your preferred tools or products in your comment. 


Niara is making use of big data techniques and Hadoop. "The core intellectual property of Niara is in the collection, storage and analysis of the data," Ramachandran said. "We have been at work for 16 months building the platform."

While some of the components in Niara's platform are open-source, the big challenge has been in aligning an entire application stack to be able to handle the scale that is needed, Ramachandran said. "You have to be very smart about how you process data and how you move it around," Ramachandran said.


2: IBM QRadar Security Intelligence Platform and IBM Big Data Platform


IBM QRadar Security Intelligence Platform and IBM Big Data Platform provide a comprehensive, integrated approach that combines real-time correlation for continuous insight, custom analytics across massive structured and unstructured data, and forensic capabilities for irrefutable evidence. The combination can help you address advanced persistent threats, fraud and insider threats.

The IBM solution is designed to answer questions you could never ask before, by widening the scope and scale of investigation. You can now analyze a greater variety of data – such as DNS transactions, emails, documents, social media data, full packet capture data and business process data – over years of activity. By analyzing structured, enriched security data alongside unstructured data from across the enterprise, the IBM solution helps find malicious activity hidden deep in the masses of an organization's data.



3: Cyphort


The Cyphort Advanced Threat Defense Platform detects advanced malware, prioritizes remediation and automates containment. Cyphort customers benefit from early and reliable detection and fast remediation of breaches across their infrastructure. Our unique approach combines best-in-class malware detection with the knowledge of threat severity, value of targeted user and assets, and malware lifecycle to prioritize threats that matter to you while suppressing the noise. The Cyphort platform is a network-based solution that is designed to be deployed across the entire organization cost effectively. Flexibility to deploy as hardware, software and virtual machine makes Cyphort an ideal solution for large and distributed organizations. 

4: Teradata

www.teradata.com/Cyber-Security-Analytics


5: Intel Security Connected System:

For Intel, "intelligence awareness" translates to a new security product architecture that weaves the existing portfolio of McAfee products, including everything from PC software to data center firewalls, into a data collection backbone feeding a centralized repository used to correlate security anomalies from, across multiple systems

6: Sqrrl
Sqrrl is the Big Data Analytics company that lets organizations pinpoint and react to unusual activity by uncovering hidden connections in their data. Sqrrl Enterprise is Sqrrl's linked data analysis platform that gives analysts a way to visually investigate these connections, allowing them to rapidly understand their surrounding contexts and take action. At the core of Sqrrl's architecture are a variety of Big Data technologies, including Hadoop, link analysis, machine learning, Data-Centric Security, and advanced visualization. 

7: Platfora and MapR Technology 

Platfora provided a wide range of capabilities for preparing the data for analysis which considerably reduced data preparation time. After completing the preparation of the data, the emphasis shifted to developing and understanding the data using a variety of visualization techniques.

8: Splunk

While Splunk can certainly address the tier-1 needs of reduction and correlation, Splunk was designed to support a new paradigm of data discovery. This shift rejects a data reduction strategy in favor of a data inclusion strategy. This supports analysis of very large datasets through data indexing and MapReduce functionality pioneered by Google. This gives Splunk the ability to collect data from virtually any available data source without normalization at collection time and analyze security incidents using analytics and statistical analysis.