Documentation Experience Platform Data Science Workspace Guide

Create a model using JupyterLab Notebooks

Last update: August 5, 2024

Topics:
Data Science Workspace

CREATED FOR:

User
Developer

NOTE

Data Science Workspace is no longer available for purchase.

This documentation is intended for existing customers with prior entitlements to Data Science Workspace.

This tutorial walks you through the required steps to create a model using the JupyterLab notebooks recipe builder template.

Concepts introduced:

Recipes: A recipe is Adobe’s term for a model specification and is a top-level container representing a specific machine learning, AI algorithm or ensemble of algorithms, processing logic, and configuration required to build and execute a trained model.
Model: A model is an instance of a machine learning recipe that is trained using historical data and configurations to solve for a business use case.
Training: Training is the process of learning patterns and insights from labeled data.
Scoring: Scoring is the process of generating insights from data using a trained model.

Download the required assets

Before you proceed with this tutorial, you must create the required schemas and datasets. Visit the tutorial for creating the Luma propensity model schemas and datasets to download the required assets and set up the pre-requisites.

Get started with the JupyterLab notebook environment

Creating a recipe from scratch can be done within Data Science Workspace. To start, navigate to Adobe Experience Platform and select the Notebooks tab on the left. To create a new notebook, select the Recipe Builder template from the JupyterLab Launcher.

The Recipe Builder notebook allows you to run training and scoring runs inside the notebook. This gives you the flexibility to make changes to their train() and score() methods in between running experiments on the training and scoring data. Once you are happy with the outputs of the training and scoring, you the can create a recipe and furthermore publish it as a model using the recipe to model functionality.

NOTE

The Recipe Builder notebook supports working with all file formats but currently the create recipe functionality only supports Python.

When you select the Recipe Builder notebook from the launcher, the notebook is opened in a new tab.

In the new notebook tab at the top, a toolbar loads containing three additional actions – Train, Score, and Create Recipe. These icons only appear in the Recipe Builder notebook. More information about these actions are provided in the training and scoring section after building your Recipe in the notebook.

Get started with the Recipe Builder notebook

In the provided assets folder is a Luma propensity model propensity_model.ipynb. Using the upload notebook option in JupyterLab, upload the provided model and open the notebook.

upload notebook

The remainder of this tutorial covers the following files that are pre-defined in the propensity model notebook:

Requirements file
Configuration files
Training data loader
Scoring data loader
Pipeline file
Evaluator file
Data Saver file

The following video tutorial explains the Luma propensity model notebook:

video poster

https://video.tv.adobe.com/v/333570

Transcript

In this video, we are going to walk you through using the Recipe Builder template to train and score a propensity model which will be used to create a recipe. In order to use the Recipe Builder template, some prerequisites are required. For this demo, I have already created my training, scoring, and scoring output datasets. Let’s start by navigating to the Notebooks tab under Data Science and select Jupiter Lab. Next, if you’re following along, import the Data Science Course Propensity Model notebook. This notebook uses a modified version of the Recipe Builder template found in the Jupiter Lab launcher. In order for the Recipe Builder template to work, you will need to set the configuration parameters to point to the correct datasets. Additionally, if your model was using any specific libraries, you would want to install the libraries in the optional requirements file. If you are experimenting or building a model for the first time, you should use a different scoring output dataset that is not enabled for profile. Using a profile-enabled dataset will result in additional costs and once a dataset is in profile, it is difficult to remove the data. In this example, we have a working model and have already revised the notebook, hence we are using a profile-enabled dataset. To recap, we have already explored our data to understand the distributions behind the data, what columns exist and their unique values, and also, we know our important features. The next step would be to perform feature engineering to train our model. It is important to note that if feature engineering is done correctly, even simplistic models can produce great results. We start by defining the training data loader which is used to read all our platform data and perform feature engineering. I start by cleaning the data and drop product list items. This model is about predicting the purchase propensity of whether a customer will buy a product or not. Because we are not looking at specific products, I can drop the product list items. Next, I drop additional columns that contain only a single value or two different values in a single column. This kind of data is not very useful for training our model and will make it a little bit more inconsistent. Once we have dropped any unnecessary data, we can begin the feature engineering process. You might have noticed that the provided dataset does not contain any session information. Normally, to make a prediction like this, we would want to know the current and past session of a particular customer or user. Because we do not have session information, the following code is needed to help differentiate between the previous and the most recent journeys. We do this so that the model can differentiate between the actions certain users take when they buy a product versus just browsing. So if we look at the Journey Demarcation column here, I am using demarcation to define when the past and present journeys should be divided. Once the demarcation is done, I mark whether the customer has ordered in the current session or not. I then proceed to label all the data based on the demarcation to create a journey of the customer, then divide this journey into past and present. Once that is done, I create the features such as total past checkouts, how many times a user checked out in the past, total current checkouts, how many times this user checked out in the current session, and so on. Once we have this data, we proceed to drop any column that we do not need. This should leave us with both the past and current journeys where a user or customer bought a product or did not buy a product, and how they got there. The idea is to use the patterns that lead to a successful checkout to predict if the customer will buy a product based on the past and current features we created. Below the training data loader is the scoring data loader file. The reason that these files are separate is because the training data could come from a different source. This means that for the model to score the data, additional pre-processing steps may be needed, such as altering the mapping, to work with our scoring data. Next, we have the pipeline file where all our training code is defined. To evaluate our data, we break it down into two parts using a train-test split. For the training part, we need the X and the labels. For the testing part, we need the Y and the labels. The comparison is done based on the X and Y values. The score function uses the config properties, data, and model. I call model.predict on the data, and after the prediction is done, return the remaining data. The evaluator file below implements the split on our X and Y values. The split method is responsible for splitting our data into the training and holdout set. We use the training set to train our model and the holdout set for the final evaluation. Once the split is returned, I use the evaluate method. I run model.predict X, and based on the prediction value, I try to calculate a number of metrics such as the accuracy score, precision, recall, and F1. Scrolling down, we can look at the returned values for our model once the training is done. Initially, our accuracy and other values might not be within an acceptable range. This is normal. You may have to iterate and tweak hyperparameters or use different data combinations. If you plan on using hyperparameters, they can be added directly to the training and scoring configuration cells as needed. You also have the option to add or modify these parameters once everything is packaged into a recipe. The last cell in this notebook, the data saver file, is used to write our model output results back to the scoring result dataset. The predictions that were made using the predict method are passed along with the config properties. I then convert my data into the format used by my scoring results dataset. Once I have the proper format, I use the dataset writer to write my model output to the scoring results dataset. Since this dataset is already enabled for profile, the data is automatically ingested and sent to real-time customer profile. Once all our code follows this structure, we can train and score our model by selecting the train button followed by the score button. The train, score, and create recipe buttons are unique to JupyterLab notebooks on platform. In order to run scoring, at least one successful training run needs to have been completed. Once you are happy with the training and scoring output, you can select the create recipe button. A popover appears asking you to enter a recipe name. After entering recipe name, select OK and a new popover appears to confirm the recipe creation. It will take some time to create the recipe. You can see the status of the recipe next to the create recipe button while you wait. Once complete, the status bar will change to an icon which links to the latest recipe created using this notebook. The recipe UI contains feature mapping and configuration parameters that we set up in the notebook. If you wish to alter the recipe training or scoring parameters, we can select configuration from within the recipe UI followed by selecting edit within either the training or scoring parameters tab. This allows us to add new values or edit certain existing values as well as upload new configurations. Next to Data Science Workspace, data scientists can explore data, build models, and operationalize them in the same workflow by packaging their work into reusable recipes. These recipes can then be used by various roles in your enterprise without the explicit need for data science knowledge. Thanks for watching.

Requirements file

The requirements file is used to declare additional libraries you wish to use in the model. You can specify the version number if there is a dependency. To look for additional libraries, visit anaconda.org. To learn how to format the requirements file, visit Conda. The list of main libraries already in use include:

python=3.6.7
scikit-learn
pandas
numpy
data_access_sdk_python

NOTE

Libraries or specific versions you add may be incompatible with the above libraries. Additionally, if you choose to create an environment file manually, the name field is not allowed to be overridden.

For the Luma propensity model notebook, the requirements do not need to be updated.

Configuration files

The configuration files, training.conf and scoring.conf, are used to specify the datasets you wish to use for training and scoring as well as adding hyperparameters. There are separate configurations for training and scoring.

In order for a model to run training, you must provide the trainingDataSetId, ACP_DSW_TRAINING_XDM_SCHEMA, and tenantId. Additionally for scoring, you must provide the scoringDataSetId, tenantId, and scoringResultsDataSetId .

To find the dataset and schema IDs, go to the data tab within notebooks on the left navigation bar (under the folder icon). Three different dataset ID’s need to be provided. The scoringResultsDataSetId is used to store the model scoring results and should be an empty dataset. These datasets were made previously in the Required assets step.

The same information can be found on Adobe Experience Platform under the Schema and Datasets tabs.

Once compete, your training and scoring configuration should look similar to the following screenshot:

configuration

By default, the following configuration parameters are set for you when you train and score data:

ML_FRAMEWORK_IMS_USER_CLIENT_ID
ML_FRAMEWORK_IMS_TOKEN
ML_FRAMEWORK_IMS_ML_TOKEN
ML_FRAMEWORK_IMS_TENANT_ID

Understanding the Training Data Loader

The purpose of the Training Data Loader is to instantiate data used for creating the machine learning model. Typically, there are two tasks that the training data loader accomplishes:

Loading data from Platform
Data preparation and feature engineering

The following two sections will go over loading data and data preparation.

Loading data

This step uses the pandas dataframe. Data can be loaded from files in Adobe Experience Platform using either the Platform SDK (platform_sdk), or from external sources using pandas’ read_csv() or read_json() functions.

Platform SDK
External sources

NOTE

In the Recipe Builder notebook, data is loaded via the platform_sdk data loader.

Platform SDK

For an in-depth tutorial on using the platform_sdk data loader, please visit the Platform SDK guide. This tutorial provides information on build authentication, basic reading of data, and basic writing of data.

External sources

This section shows you how to import a JSON or CSV file to a pandas object. Official documentation from the pandas library can be found here:

First, here is an example of importing a CSV file. The data argument is the path to the CSV file. This variable was imported from the configProperties in the previous section.

df = pd.read_csv(data)

You can also import from a JSON file. The data argument is the path to the CSV file. This variable was imported from the configProperties in the previous section.

df = pd.read_json(data)

Now your data is in the dataframe object and can be analyzed and manipulated in the next section.

Training Data Loader File

In this example, data is loaded using the Platform SDK. The library can be imported at the top of the page by including the line:

from platform_sdk.dataset_reader import DatasetReader

You can then use the load() method to grab the training dataset from the trainingDataSetId as set in the configuration (recipe.conf) file.

def load(config_properties):
    print("Training Data Load Start")

    #########################################
    # Load Data
    #########################################
    client_context = get_client_context(config_properties)
    dataset_reader = DatasetReader(client_context, dataset_id=config_properties['trainingDataSetId'])

NOTE

As mentioned in the Configuration File section, the following configuration parameters are set for you when you access data from Experience Platform using client_context = get_client_context(config_properties):

ML_FRAMEWORK_IMS_USER_CLIENT_ID
ML_FRAMEWORK_IMS_TOKEN
ML_FRAMEWORK_IMS_ML_TOKEN
ML_FRAMEWORK_IMS_TENANT_ID

Now that you have your data, you can begin with data preparation and feature engineering.

Data preparation and feature engineering

After the data is loaded, the data needs to be cleaned and undergo data preparation. In this example, the goal of the model is to predict whether a customer is going to order a product or not. Because the model is not looking at specific products, you do not need productListItems and therefore the column is dropped. Next, additional columns are dropped that only contain a single value or two values in a single column. When training a model, it’s important to only keep useful data that will assist in predicting your goal.

example of data prep

Once you have dropped any unnecessary data, you can begin feature engineering. The demo data used for this example does not contain any session information. Normally, you would want to have data on the current and past sessions for a particular customer. Due to the lack of session information, this example instead mimics current and past sessions via journey demarcation.

Journey demarcation

After the demarcation is complete, the data is labeled and a journey is created.

label the data

Next, the features are created and divided into past and present. Then, any columns that are unnecessary are dropped, leaving you with both the past and current journeys for Luma customers. These journeys contain information such as whether a customer purchased an item and the journey they took leading up to the purchase.

final current training

Scoring data loader

The procedure to load data for scoring is similar to loading training data. Looking closely at the code, you can see that everything is the same except for the scoringDataSetId in the dataset_reader. This is because the same Luma data source is used for both training and scoring.

In the event that you wanted to use different data files for training and scoring, the training and scoring data loader are separate. This allows you to perform additional pre-processing such as mapping your training data to your scoring data if necessary.

Pipeline file

The pipeline.py file includes logic for training and scoring.

The purpose of training is to create a model using features and labels in your training dataset. After choosing your training model, you must fit your x and y training dataset to the model and the function returns the trained model.

NOTE

Features refer to the input variable used by the machine learning model to predict the labels.

def train

The score() function should contain the scoring algorithm and return a measurement to indicate how successful the model performs. The score() function uses the scoring dataset labels and the trained model to generate a set of predicted features. These predicted values are then compared with the actual features in the scoring dataset. In this example, the score() function uses the trained model to predict features using the labels from the scoring dataset. The predicted features are returned.

def score

Evaluator file

The evaluator.py file contains logic for how you wish to evaluate your trained recipe as well as how your training data should be split.

Split the dataset

The data preparation phase for training requires splitting the dataset to be used for training and testing. This val data is used implicitly to evaluate the model after it is trained. This process is separate from scoring.

This section shows the split() function which loads data into the notebook, then cleans up the data by removing unrelated columns in the dataset. From there, you can perform feature engineering which is the process to create additional relevant features from existing raw features in the data.

Split function

Evaluate the trained model

The evaluate() function is performed after the model is trained and returns a metric to indicate how successful the model performs. The evaluate() function uses the testing dataset labels and the trained model to predict a set of features. These predicted values are then compared with actual features in the testing dataset. In this example the metrics used are precision, recall, f1, and accuracy. Notice that the function returns a metric object containing an array of evaluation metrics. These metrics are used to evaluate how well the trained model performs.

evaluate

Adding print(metric) allows you to view the metric results.

metric results

Data Saver file

The datasaver.py file contains the save() function and is used to save your prediction while testing scoring. The save() function takes your prediction and using Experience Platform Catalog APIs, writes the data to the scoringResultsDataSetId you specified in your scoring.conf file. You may

Data saver

Training and scoring

When you are done making changes to your notebook and want to train your recipe, you can select the associated buttons at the top of the bar to creating a training run in the cell. Upon selecting the button, a log of commands and outputs from the training script appears in the notebook (under the evaluator.py cell). Conda first installs all the dependencies, then the training is initiated.

Note that you must run training at least once before you can run scoring. Selecting the Run Scoring button will score on the trained model that was generated during training. The scoring script appears under datasaver.py.

For debugging purposes, if you wish to see the hidden output, add debug to the end of the output cell and re-run it.

train and score

Create a recipe

When you are done editing the recipe and satisfied with the training/scoring output, you can create a recipe from the notebook by selecting Create Recipe in the top-right.

After selecting Create Recipe, you are prompted to enter a recipe name. This name represents the actual recipe created on Platform.

Once you select Ok, the recipe creation process begins. This can take some time and a progress bar is displayed in place of the create recipe button. Once complete, you can select the View Recipes button to take you to the Recipes tab under ML Models

CAUTION

Do not delete any of the file cells
Do not edit the %%writefile line at the top of the file cells
Do not create recipes in different notebooks at the same time

Next steps

By completing this tutorial, you have learned how to create a machine learning model in the Recipe Builder notebook. You have also learned how to exercise the notebook to recipe workflow.

To continue learning how to work with resources within Data Science Workspace, please visit the Data Science Workspace recipes and models dropdown.

recommendation-more-help