Recently, Microsoft announced the availability of Machine Learning (ML) Studio for its cloud platform Azure. ML Studio offers a suite of tools to easily build and deploy predictive models. Building predictive models is termed experiments in ML Studio. Once you have build your model, it can be deployed by the web services component of ML Studio. ML Studio is a browser based application where you drag different modules to a canvas and interconnect them as per the needs. ML Studio comes with a plenty of ready predictive models to train for a variety of situations. Several well known data sets popular in the machine learning community are part of the studio and offer a convenient path to get started quickly to learn what the studio has to offer in terms of its capabilities. Another noteworthy feature of ML Studio is that you can use R inside the studio and there are over 400 R packages pre-installed . Yet another notable feature is the presence of modules, such as Named Entity Recognizer, for text analytics. With all these goodies, it is tempting to play with ML Studio. Microsoft has made it easy to do this by offering a trial Azure account; you can open one quickly by signing up for a trial account good for 30 days.
Assuming you have created an account on Azure, the first thing to do is to set up a ML Workspace in the Azure management portal.
After you sign-in to ML Studio, you will be presented with the ML Studio home page with links to various information pages and user guide.
Step 1 of Model Building: Getting Data
Now you are ready to build your predictive model. The first thing you need to build a model is data. ML Studio comes with many datasets for building different types of predictive models, for example classification, regression, and clustering etc. You can also upload your own data. We will build our model for classification using the sonar data from the UCI Machine Learning repository. The sonar dataset contains 208 data vectors with 60 measurements that carry information about signal energy bounced off from two different kinds of objects, rocks and a metal cylinder, at different angles. The dataset has 111 vectors representing readings off a metal cylinder and 97 readings off rocks. After downloading the data, save it into CSV format; it is one of the many formats that ML Studio accepts. A word of caution here. Readying data for predictive modeling is the most time consuming step; it typically consumes about 60-70% of effort of an enterprise data mining project because data resides in several different databases and often has errors and many missing fields. So do not underestimate the effort you will need if you were to work on a data mining project from scratch. Check out my slide show on this aspect.
To upload the sonar data to your ML Studio workspace, click +NEW at the bottom left of the ML Studio window and then click DATA in the resulting screen. This will bring up a dialog window for you to upload the sonar data.
Having uploaded the data, click +NEW again followed by clicking EXPERIMENT. This will bring up the experiment canvas with a palette of available datasets and modules on the left. The Properties windows on the top right is meant for setting parameter values as appropriate for different modules. The small window at the bottom right displays information about the selected module, if any, in the experiment canvas.
Clicking Saved Datasets in the left pane will show you all the datasets currently present in ML Studio. Select the sonar data box and drag it to the experimental canvas.
Before building any predictive model, it is always a good idea to get to know the data, for example, its descriptive statistics, relations between different attributes etc. So double click at the output port at the bottom edge of the sonar data and select Visualize. The resulting display will show summary statistics for each attribute/variable in a column followed by rows of data with each row corresponding to one of instances of the sonar data.
To make use of the graphing window, let us select Col 20 and Col 61 (containing R or C for rock or cylinder and telling the predictive model being trained what should be its output) to visualize how the values in Col 20 vary for rocks and cylinder. You can try this other columns as well to get a better feel of the data.
Step 2 of Model Building: Dividing Data into Training and Test Sets
When you build a predictive model, you want to know how the model is going to perform in future. This is typically done by dividing the available data for model building into two sets: training set and test set. The training set is used to train the model and the test set is used to check the performance of the model to get a better estimate of the model’s performance upon deployment. We will use the Split module to divide the data into two groups. You can locate this module by typing its name in the search box in the left pane. Once you drag the Split module and connect the output port of the sonar data to the input port of the Split module, you will notice Properties box displaying the parameters of the Split module. By default, this module divides the data equally into two groups. The usual practice in machine learning is to follow the golden rule for data division wherein we use 80% of the data for training and 20% for testing. So let us enter 0.8 in the box Fraction of rows in the first output. Also set True in the box marked Stratified split and enter Col61 in the Stratification key column box. This tells the studio that grouping of the data should be done using Col61 carrying the class label, i.e. rocks or cylinder in the present case.
Step 3 of Model Building: Select and Train a Model
Now is the time to decide about the model we want to use for our task. Since our task is a classification task, we need to select one of the available classification models. We can see the available models by clicking Machine Learning in the left pane. We will use the Two-Class Boosted Decision Tree model. A boosted decision tree is a series of small decision trees utilizing the idea of boosting to improve predictive performance. See for details. After dragging the boosted decision tree module, drag the Train Model module to the experiment canvas and connect them as shown. In the Properties widow, you will see the default parameters for the boosted decision tree model. You can use the default values or modify them.
In the Properties window, we need to tell the Train Model module the data column containing class labels. So enter Col61 in the Selected Column box. Once done, click Run at the bottom of the ML Studio screen. Soon you will see green check marks appearing in different modules to show how the training is progressing.
Step 4 of Model Building: Test the Trained Model
You can now test how good is your model by using the set aside test data. This is done by using the Score Model module. So locate and drag this module to the canvas and connect to its inputs the output port of the Train Model module and the right output port of the Split module where the test data is available.
You can check the predicted output for the test data by clicking double clicking the output port of the Score Model module and selecting Visualize. To get the performance metrics of the trained model, we will use the Evaluate Model module.
Run the experiment again and double click the output port of the Evaluate Model to visualize the results.
We see that the model is able to achieve 80.5% accuracy on the test data.This accuracy is inline with the reported results on the sonar data. You can also play with the setting of the decision threshold which in the above results is set to the default value of 0.5. By lowering it to 0.25, you get a new set of numbers as shown below. In practice, the choice of the decision threshold will be related to cost of false negatives and false positives.
Step 5 of Model Building: Saving the Trained Model
Now that we are satisfied with our trained model, it should be saved for later use and deployment as a web service. Click SAVE AS to save the model with comments.
This completes the process of model building in ML Studio. If you are still with me, then I know you are going to build another model for the same data but using a different learning algorithm to compare their performance. Let me know if you have any questions or comments or suggestions.