Generate Machine Learning Predictions Without Writing Code

TUTORIAL

Overview

Amazon SageMaker Canvas is a visual point-and-click interface that expands the use of machine learning (ML) to business analysts, helping them to make business decisions without any ML experience. With SageMaker Canvas, business analysts can build ML models and generate predictions on their own. As a SageMaker Canvas user, you can import data from disparate sources; pick the target variables needed for predictions; and prepare and analyze data. Using the built-in AutoML capabilities, in just a few clicks you can build an ML model and generate accurate predictions, either single or in bulk, to assist with business decisions.
 
In this tutorial, you will learn how to use Amazon SageMaker Canvas to build an ML model that can predict the estimated time of arrival (ETA) of shipments (measured in days). You will use a dataset that contains complete shipping data for delivered products, including estimated time, shipping, priority, carrier, and origin.

What you will accomplish

In this tutorial, you will:

  • Import datasets
  • Select the target variable for classification
  • Inspect datasets visually
  • Build an ML model with the SageMaker Canvas Quick Build feature
  • Understand model features and metrics
  • Generate and understand bulk and single predictions

Prerequisites

Before starting this tutorial, you will need:

 AWS experience

Beginner

 Time to complete

20 minutes

 Cost to complete

Less than $1.00, free tier eligible.

 Requires

An AWS account.

 Services used

 Last updated

May 17, 2024

Implementation

  • If you already have a SageMaker Studio domain in the US East (N. Virginia) Region, follow the SageMaker Studio tutorial to attach the required AWS IAM policies to your SageMaker Studio account. Once completed, skip directly to Step 2.

    If you don't have an existing SageMaker Studio domain, continue with this step to launch an AWS CloudFormation template that creates a SageMaker Studio domain and adds the permissions required for the rest of this tutorial.

    Note: This stack assumes that you already have a public VPC set up in your account. If you do not have a public VPC, see VPC with a single public subnet to learn how to create a public VPC.

    1. Launch the AWS CloudFormation stack template.

    • This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account.

    2. On the Quick create stack page, confirm that US East (N. Virginia) is the Region displayed in the upper right corner.

    3. For Provide a stack name, enter CFN-SM-IM-Lambda-Catalog.

    4. In the Parameters section, verify the following:

    • BusinessAnalyst name is 1-business-analyst
    • DomainName name is immersionday-domain

    5. In the Capabilities section, select the checkbox for I acknowledge that AWS CloudFormation might create IAM resources with custom names.

    6. Then, choose Create stack.

    Note: This stack takes about 10 minutes to create all the resources.

    7. Once you receive a CREATE_COMPLETE message, you can proceed to the next step.

  • 1. Navigate to the SageMaker Canvas console.

    2. In the Get Started section, choose Open Canvas.

    • Verify that you are in the US East (N. Virginia) Region.
       

    The SageMaker Canvas Creating application screen will be displayed. The application will take a few minutes to load.

    Note: If this is your first time using SageMaker in us-east-1 Region, SageMaker Canvas creates an Amazon S3 bucket with a name that uses the following pattern: sagemaker-<your-Region>-<your-account-id>.

    3. Download the following datasets and save them to your local computer.

  • Import data into SageMaker Canvas, visually inspect, and prepare the data for model building.

    1. Navigate back to the Amazon SageMaker Canvas application to import the previously downloaded datasets into SageMaker Canvas.

    2. In the left-hand section, choose Data Wrangler, and then choose Create data flow.

    3. In the Create data flow pop up, for Data flow name, enter product_descriptions, and select Create.

    4. Choose Import data, and select Tabular.

    5. On the Import data page, choose Select files from your computer.

    6. Upload the product_descriptions.csv dataset you previously downloaded on your local machine, and choose Preview data.
     

    7. A preview of the data will display, choose Import data.

    8. Complete Step 2 through Step 7 again to upload the shipping_logs dataset bullet point.

    • In the Create data flow pop up, for Data flow name, enter shipping_logs, and select Create.

     

    9. In the left-hand navigation, choose Data Wrangler, and select the Datasets tab. Then, choose Join data.

    10. On the Join Datasets page, drag the canvas-sample-product-descriptions.cvs and canvas-sample-shipping-logs.csv from the left panel onto the right pane. Select the join icon between the two datasets.

    11. A pop-up showing details about the join will appear. Make sure that the join type is Inner and the joining column is ProductId. Then, choose Save & close, and choose Import data.

    12. In the Import data dialog box, enter ConsolidatedShippingData in the Import dataset name field, and choose Import data.

    13. On the Data Wrangler page, check the box beside ConsolidatedShippingData. Then, choose Create a data flow.

    14. On the Create a data flow dialog box, enter ConsolidatedShiping-Prep, and select Create.

    15. On the Data flow page, you will see a preview of the dataset and visualizations for the first 2,000 rows of the data, but before you explore the data analysis, you will complete some initial clean-up.

    Note: The chat for data prep requires access to Amazon Bedrock and the Anthropic Claude model within it. If you haven’t provisioned this access, use the following steps before proceeding:

    16. Navigate to the Amazon Bedrock console, in the left-hand navigation choose Model access.
     

    17. On the Model access page, choose Manage model access.

    18. On the Manage model access page, select Anthropic. Then, choose Request model access.

    19. Choose the Chat for data prep icon.

    • Using chat for data prep, you can explore, visualize, and transform your data using natural language.
       

    20. In the chat box, type Plot actual vs expected shipping days, and press enter.

    • The graph shows there is a close relationship between expected and actual shipping days. In addition to the graph, you will see a message explaining the visualization and even how you can leverage it.
    • If you select View code, you will see the code that was used to generate the visualization.
       

    21. In the chat box, type Drop ProductID and OrderID, and press enter.

    • ProductID and OrderID are primary keys and not expected to contain any valuable information for training a model, you will want to drop them.
       

    You will see an explanation in the chat box and the code used to achieve the step, but this time you will also see a preview of how this will impact your dataset.

    22. Select Add to steps to include the step in your workflow.
     

    23. Select the Analyses tab. In the Create analysis section, select the drop-down, and choose the Data Quality And Insights Report.

    24. For Analysis name, enter Initial-Report, for Target Column, select ActualShippingDays, for Problem type, select Regression, and for Data size, select Sampled dataset (20k). Then, choose Create.

    25. Once the report is finished generating, take a moment to review the insights and visualizations for the dataset. You should notice that the data is already in good shape for training a model, with no missing data and no meaningful outliers.
     
    • The Features Prediction power graph has a few features that don’t provide useful information for predicting the target, you will want to remove them before training the model.
    • The feature providing the second highest Prediction power, OnTimeDelivery, is not information that would be available when predictions are made using the generated model. This is referred to as data leakage and that column will also need to be dropped before training the model.
       

    26. To action on these insights, at the top of the page, choose ConsolidatedShippingData-Prep.flow to navigate back to the Data flow page.

    27. Select the ellipses icon next to the chat transform step, and choose Add transform.

    28. Select Add step.

    29. On the Add transform page, scroll down and choose Manage Columns.

    30. On the Manage colums page, perform the following selections:
     
    • Leave Transform set to Drop column
    • For Columns to drop, select OnTimeDelivery, ComputerBrand, ComputerModel, ScreenSize, PackageWeight, and OrderDate
    • Confirm the correct columns are selected, then choose Add
  • Now that you have completed the data preparation, you are ready to train a model. In this step, you will set up the target variable and initiate the model building process.

    1. Choose Create model.

    2. In the Export data and create model popup, Name the new dataset ConsolidatedShippingData-Clean, and select Export.

    3. Once the export completes, select Create model on the pop-up at the bottom left of the screen.

    4. In the Create new model popup, for the Model name, enter ShippingForecast, leave the default selection for Predictive analysis. Then, choose Create.

    The model view page consists of four tabs which represent the steps involved in building a model and getting predictions. The tabs are:

    • Select – Set up the input data.
    • Build – Build the ML model.
    • Analyze – Analyze the model output and features.
    • Predict – Run predictions in bulk or on a single sample.
    • Deploy – Deploy your model to a SageMaker endpoint to consume programmatically

    5. On the Select tab, choose the radio button for the ConsolidatedShippingData-Clean dataset that you created in the previous step, and choose Select dataset.

    • This dataset contains 9 columns and 1,000 rows. It also contains a high-level description of dataset shape and size.
       

    SageMaker Canvas automatically moves to the Build phase.

    6. For Select column to predict, in the Target column drop down, choose ActualShippingDays.

    • Since this column contains the historical number of days required for goods to arrive, it is suitable to be used as the target column.
    • Once the target column is selected, SageMaker Canvas automatically tries to infer the problem type. Because you are interested in how many days it will take for the goods to arrive for the customer, this is a regression or numerical prediction problem.
    • Regression estimates the values of a dependent target variable based on one or more other variables or attributes that are correlated with it. In this case, SageMaker Canvas initially may predict this use case as a 3+ category prediction type problem. However, you can manually change the problem type to a Numeric model type.
    • In the Model type section, choose Configure model.

    7. In the Model type section, choose Numeric model type, and choose Save.

  • In SageMaker Canvas, there are two methods for training: Quick build and Standard build. The Quick build usually takes 2-15 minutes to build the model, whereas the standard build usually takes 2-4 hours and generally has a higher accuracy. Quick build trains fewer combinations of models and hyperparameters to prioritize speed. For this tutorial, you will use Quick build to begin model building.

    1. Choose Quick build.

    The build takes approximately 5 minutes to validate and complete. Once completed the page model will move to the Analyze step where you can view the quick training results.

    Once completed, the SageMaker Canvas model built using Quick build can predict the number of shipping days within +/-1.693 of the actual valu

    • Machine Learning introduces some stochasticity in the process of training models, which can lead to different results to different builds. Therefore, the exact performance in terms of the metrics that you see might be different.

    2. On the Overview tab, SageMaker Canvas shows the Column impact or the estimated importance of each input column in predicting the target column. In this example, the ExpectedShippingDays column has the most significant impact on predicting the number of shipping days.

    • On the right panel, you can see the direction of impact of a feature as well. For example, the higher the value of ExpectedShippingDays, the more positive its impact on the number of shipping days prediction.
       
    • On the Scoring tab, you can see a plot representing best fit regression line for ActualshippingDays. On average, the model prediction has a difference of +/- 1.693 from the actual value of ActualShippingDays.
    • The Scoring section for numeric prediction shows a line to indicate the model's predicted value in relation to the data used to make predictions.
    • The values of the numeric prediction are often +/- the RMSE (root mean square error) value.
    • The value that the model predicts is often within the range of the RMSE. The width of the purple band around the line indicates the RMSE range. The predicted values often fall within the range.
    • On the Advanced metrics tab you can see more details of the model performance.
      • The various metrics shown on the Advanced metrics tab are R2, mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE). The Advanced metrics page also shows plots for visual inspection of the model performance. One image shows a graph of the residuals or errors. The horizontal line indicates an error of 0 or a perfect prediction. The blue dots are the errors. Their distance from the horizontal line indicates the magnitude of the errors.
         
    • If you choose Error density you can see the distribution of the errors and their spread with respect to MAE and RMSE of the model. An error density with a shape similar to a normal distribution is indicative of good model performance.
  • Now that you have a regression model, you can either use the model to run predictions, or you can create a new version of this model to train with the Standard build process. In this step, you use SageMaker Canvas to generate predictions, both single and in bulk, from a dataset.

    1. To start generating predictions, choose the Predict tab.

    2. In the Predict target values section, choose Manual, to make one-time batch prediction.

    Note: Selecting Automatic would allow you to make batch predictions for a dataset every time the data set is updated. In actual ML workflows, this dataset should be separate from the training dataset. However, for simplicity, you use the same dataset to demonstrate how SageMaker Canvas generates predictions. Choose Generate predictions.
     

    3. On the Select datasets for prediction page, select the ConsolidatedShippingData-Clean dataset, and choose Generate predictions.

    After a few seconds, you will be notified that the prediction is done.

    4. Choose View from the message window at the bottom of the page to see a preview of the predictions.

    • You can also choose Download to download a CSV file containing the full output. SageMaker Canvas returns a prediction for each row of data.
    • In this tutorial, the feature with the highest importance is the ExpectedShippingDays feature. It is also presented beside the predictions for a visual comparison.
       

    5. On the Predict page, you can generate predictions for a single sample by selecting Single prediction.

    • SageMaker Canvas presents an interface in which you can manually enter values for each of the input variables used in the model. This type of analysis is ideal for what-if scenarios where you want to know how the prediction changes when one or more variables increase or decrease in value.
       

    6. After the model building process, SageMaker Canvas uploads all artifacts including the trained model saved as a pickle file, metrics, datasets, and predictions into the S3 bucket that was SageMaker Canvas previously created for you (sagemaker-<your-region>-<your-account-id>) under a location named Canvas/1-business-analyst. You can inspect the contents and use them as necessary for further development.

  • It is a best practice to delete resources that you are no longer using so that you don't incur unintended charges.

    1. Navigate to the S3 console and choose Buckets. Navigate to the sagemaker-<your-Region>-<your-account-id> bucket and select the radio button, and choose Empty.

    2. In the Permanently delete all objects in bucket section, confirm by entering permanently delete in the text field, and choose Empty.

    3. Navigate back to the sagemaker-<your-Region>-<your-account-id> bucket and select the radio button, and choose Delete.

    4. In the Delete bucket section, confirm by entering the bucket name in the text field, and choose Delete bucket.

    5. On the SageMaker Canvas main page, choose My Models. For the ShippingForecast model, choose the vertical ellipsis, and select Delete.

    6. In the pop up, select Delete to confirm that you want to delete the model.

    7. In the left-hand navigation, choose on Log out to end your Canvas session.

    Note: If you used an existing SageMaker Studio domain you can skip the rest of the steps.

    If you ran the CloudFormation template in Step 1 to create a new SageMaker Studio domain, continue with the following steps to delete the domain, user, and the resources created by the CloudFormation template.

    8. Navigate to the CloudFormation console, choose CFN-SM-IM-Lambda-catalog, and choose Delete to delete the stack along with the resources it created.
     

Congratulations

You have successfully used Amazon SageMaker Canvas to import and prepare a dataset for ML from Amazon S3, select the target variable, build an ML model using the quick build mode, and use the visual interface.

Was this page helpful?