- Data Science Workspace overview
- Data Science Workspace access and features
- Data Science Workspace walkthrough
- JupyterLab notebooks
- Sensei Machine Learning API
- Real-time Machine Learning (Alpha)
- Pre-built recipes
- Model Authoring
- Model and recipe tutorials
- Create and publish a machine-learning model
- Create the Retail Sales schema and dataset
- Preview schemas and datasets
- Package source files into a recipe
- Import a packaged recipe (UI)
- Import a packaged recipe (API)
- Train and evaluate a model (UI)
- Train and evaluate a model (API)
- Score a model (UI)
- Score a model (API)
- Optimize a model
- Enrich Real-time Customer Profile with machine learning insights
- Publish a model as a service (UI)
- Publish a model as a service (API)
- Schedule automated training and scoring on a service (UI)

- Troubleshooting guide
- API reference
- Platform release notes

This document provides a walkthrough for Adobe Experience Platform Data Science Workspace. This tutorial outlines a general data scientist workflow and how they might approach and solve a problem using machine learning.

- A registered Adobe ID account
- The Adobe ID account must have been added to an Organization with access to Adobe Experience Platform and Data Science Workspace.

A retailer faces many challenges to stay competitive in the current market. One of the retailer’s main concerns is to decide on the optimal pricing of a product and to predict sale trends. With an accurate prediction model, a retailer would be able to find the relationship between demand and pricing policies and make optimized pricing decisions to maximize sales and revenue.

A data scientist’s solution is to leverage the wealth of historical information provided by a retailer, to predict future trends and to optimize pricing decisions. This walkthrough uses past sales data to train a machine learning model and uses the model to predict future sale trends. With this, you can generate insights to help make optimal pricing changes.

This overview mirrors the steps a data scientist would go through to take a dataset and to create a model to predict weekly sales. This tutorial covers the following sections in the Sample Retail Sales Notebook on Adobe Experience Platform Data Science Workspace:

In the Adobe Experience Platform UI, select **Notebooks** from within the **Data Science** tab, to bring you to the Notebooks overview page. From this page, select the JupyterLab tab to launch your JupyterLab environment. The default landing page for JupyterLab is the **Launcher**.

This tutorial uses Python 3 in JupyterLab Notebooks to show how to access and explore the data. In the Launcher page there are sample notebooks provided. The **Retail Sales** sample notebook is used in the examples provided below.

With the Retail Sales notebook opened, the first thing you should do is to load the libraries required for your workflow. The following list gives a short description for each of the libraries used in the examples in later steps.

**numpy**: Scientific computing library that adds support for large, multi-dimensional arrays and matrices**pandas**: Library that offers data structures and operations used for data manipulation and analysis**matplotlib.pyplot**: Plotting library that provides a MATLAB-like experience when plotting**seaborn**: High-level interface data visualization library based on matplotlib**sklearn**: Machine learning library that features classification, regression, support vector, and cluster algorithms**warnings**: Library that controls warning messages

After the libraries are loaded, you can start looking at the data. The following Python code uses pandas’ `DataFrame`

data structure and the read_csv() function to read the CSV hosted on Github into the pandas DataFrame:

Pandas’ DataFrame data structure is a two-dimensional labeled data structure. To quickly see the dimensions of your data, you can use `df.shape`

. This returns a tuple that represents the dimensionality of the DataFrame:

Finally, you can preview what your data looks like. You can use `df.head(n)`

to view the first `n`

rows of the DataFrame:

We can leverage Python’s pandas library to get the data type of each attribute. The output of the following call will give us information about the number of entries and the data type for each of the columns:

```
df.info()
```

This information is useful since knowing the data type for each column will enable us to know how to treat the data.

Now let’s look at the statistical summary. Only the numeric data types will be shown so `date`

, `storeType`

, and `isHoliday`

will not be outputted:

```
df.describe()
```

With this, you can see there are 6435 instances for each characteristic. Additionally, statistical information such as mean, standard deviation (std), min, max, and interquartiles are given. This gives us information about the deviation for the data. In the next section, you will go over visualization which works together with this information to give us a complete understanding of your data.

Looking at the minimum and maximum values for `store`

, you can see that there are 45 unique stores the data represents. There are also `storeTypes`

which differentiate what a store is. you can see the distribution of `storeTypes`

by doing the following:

This means 22 stores are of `storeType A`

, 17 are `storeType B`

, and 6 are `storeType C`

.

Now that you know your data frame values, you want to supplement this with visualizations to make things clearer and easier to identify patterns. These graphs are also useful when conveying results to an audience.

Univariate graphs are plots of an individual variable. A common univariate graph used to visualize your data are box and whisker plots.

Using your retail dataset from before, you can generate the box and whisker plot for each of the 45 stores and their weekly sales. The plot is generated using the `seaborn.boxplot`

function.

A box and whisker plot is used to show the distribution of data. The outer lines of the plot show the upper and lower quartiles while the box spans the interquartile range. The line in the box marks the median. Any points of data more than 1.5 times the upper or lower quartile are marked as a circle. These points are considered outliers.

Next, you can plot the weekly sales with time. You will only show the output of the first store. The code in the notebook generates 6 plots corresponding to 6 of the 45 stores in our dataset.

With this diagram, you can compare the weekly sales over a period of 2 years. It is easy to see sale peaks and trough patterns over time.

Multivariate plots are used to see the interaction between variables. With the visualization, data scientists can see if there are any correlations or patterns between the variables. A common multivariate graph used is a correlation matrix. With a correlation matrix, dependencies between multiple variables are quantified with the correlation coefficient.

Using the same retail dataset, you can generate the correlation matrix.

Notice the diagonal of ones down the center. This shows that when comparing a variable to itself, it has complete positive correlation. Strong positive correlation will have a magnitude closer to 1 while weak correlations will be closer to 0. Negative correlation is shown with a negative coefficient showing an inverse trend.

In this section, feature engineering is used to make modifications to your Retail dataset by performing the following operations:

- Add week and year columns
- Convert storeType to an indicator variable
- Convert isHoliday to a numeric variable
- Predict weeklySales of next week

The current format for date (`2010-02-05`

) can make it hard to differentiate that the data is for every week. Because of this, you should convert the date to contain week and year.

Now the week and date are as follows:

Next, you want to convert the storeType column to columns representing each `storeType`

. There are 3 store types, (`A`

, `B`

, `C`

), from which you are creating 3 new columns. The value set in each is a boolean value where a ‘1’ is set depending on what the `storeType`

was and `0`

for the other 2 columns.

The current `storeType`

column is dropped.

The next modification is to change the `isHoliday`

boolean to a numerical representation.

Now you want to add previous and future weekly sales to each of your datasets. You can do this by offsetting your `weeklySales`

. Additionally, the `weeklySales`

difference is calculated. This is done by subtracting `weeklySales`

with the previous week’s `weeklySales`

.

Since you are offsetting the `weeklySales`

data 45 datasets forwards and 45 datasets backwards to create new columns, the first and last 45 data points have NaN values. You can remove these points from your dataset by using the `df.dropna()`

function which removes all rows that have NaN values.

A summary of the dataset after your modifications is shown below:

Now, it is time to create some models of the data and select which model is the best performer for predicting future sales. You will evaluate the 5 following algorithms:

- Linear Regression
- Decision Tree
- Random Forest
- Gradient Boosting
- K Neighbors

You need a way to know how accurate your model will be able to predict values. This evaluation can be done by allocating part of dataset to use as validation and the rest as training data. Since `weeklySalesAhead`

is the actual future values of `weeklySales`

, you can use this to evaluate how accurate the model is at predicting the value. The splitting is done below:

You now have `X_train`

and `y_train`

for preparing the models and `X_test`

and `y_test`

for evaluation later.

In this section, you declare all the algorithms into an array called `model`

. Next, you iterate through this array and for each algorithm, input your training data with `model.fit()`

which creates a model `mdl`

. Using this model, you can predict `weeklySalesAhead`

with your `X_test`

data.

For the scoring, you are taking the mean percentage difference between the predicted `weeklySalesAhead`

with the actual values in the `y_test`

data. Since you want to minimize the difference between your prediction and the actual outcome, Gradient Boosting Regressor is the best performing model.

Finally, you visualize your prediction model with the actual weekly sales values. The blue line represents the actual numbers, while the green represents your prediction using Gradient Boosting. The following code generates 6 plots which represent 6 of the 45 stores in your dataset. Only `Store 1`

is shown here:

This document covered a general data scientist workflow to solve a retail sales problem. To summarize:

- Load the libraries required for your workflow.
- After the libraries are loaded, you can start looking at the data using statistical summaries, visualizations, and graphs.
- Next, feature engineering is used to make modifications to your retail dataset.
- Lastly, create models of the data and select which model is the best performer for predicting future sales.

Once you are ready, start by reading the JupyterLab user guide for a quick overview of notebooks in Adobe Experience Platform Data Science Workspace. Additionally, if you are interested in learning about Models and Recipes, start by reading the retail sales schema and dataset tutorial.