This document provides a walkthrough for Adobe Experience Platform Data Science Workspace. This tutorial outlines a general data scientist workflow and how they might approach and solve a problem using machine learning.
A retailer faces many challenges to stay competitive in the current market. One of the retailer’s main concerns is to decide on the optimal pricing of a product and to predict sale trends. With an accurate prediction model, a retailer would be able to find the relationship between demand and pricing policies and make optimized pricing decisions to maximize sales and revenue.
A data scientist’s solution is to leverage the wealth of historical information provided by a retailer, to predict future trends and to optimize pricing decisions. This walkthrough uses past sales data to train a machine learning model and uses the model to predict future sale trends. With this, you can generate insights to help make optimal pricing changes.
This overview mirrors the steps a data scientist would go through to take a dataset and to create a model to predict weekly sales. This tutorial covers the following sections in the Sample Retail Sales Notebook on Adobe Experience Platform Data Science Workspace:
In the Adobe Experience Platform UI, select Notebooks from within the Data Science tab, to bring you to the Notebooks overview page. From this page, select the JupyterLab tab to launch your JupyterLab environment. The default landing page for JupyterLab is the Launcher.
This tutorial uses Python 3 in JupyterLab Notebooks to show how to access and explore the data. In the Launcher page there are sample notebooks provided. The Retail Sales sample notebook is used in the examples provided below.
With the Retail Sales notebook opened, the first thing you should do is to load the libraries required for your workflow. The following list gives a short description for each of the libraries used in the examples in later steps.
After the libraries are loaded, you can start looking at the data. The following Python code uses pandas’
DataFrame data structure and the read_csv() function to read the CSV hosted on Github into the pandas DataFrame:
Pandas’ DataFrame data structure is a two-dimensional labeled data structure. To quickly see the dimensions of your data, you can use
df.shape. This returns a tuple that represents the dimensionality of the DataFrame:
Finally, you can preview what your data looks like. You can use
df.head(n) to view the first
n rows of the DataFrame:
We can leverage Python’s pandas library to get the data type of each attribute. The output of the following call will give us information about the number of entries and the data type for each of the columns:
This information is useful since knowing the data type for each column will enable us to know how to treat the data.
Now let’s look at the statistical summary. Only the numeric data types will be shown so
isHoliday will not be outputted:
With this, you can see there are 6435 instances for each characteristic. Additionally, statistical information such as mean, standard deviation (std), min, max, and interquartiles are given. This gives us information about the deviation for the data. In the next section, you will go over visualization which works together with this information to give us a complete understanding of your data.
Looking at the minimum and maximum values for
store, you can see that there are 45 unique stores the data represents. There are also
storeTypes which differentiate what a store is. you can see the distribution of
storeTypes by doing the following:
This means 22 stores are of
storeType A , 17 are
storeType B, and 6 are
Now that you know your data frame values, you want to supplement this with visualizations to make things clearer and easier to identify patterns. These graphs are also useful when conveying results to an audience.
Univariate graphs are plots of an individual variable. A common univariate graph used to visualize your data are box and whisker plots.
Using your retail dataset from before, you can generate the box and whisker plot for each of the 45 stores and their weekly sales. The plot is generated using the
A box and whisker plot is used to show the distribution of data. The outer lines of the plot show the upper and lower quartiles while the box spans the interquartile range. The line in the box marks the median. Any points of data more than 1.5 times the upper or lower quartile are marked as a circle. These points are considered outliers.
Next, you can plot the weekly sales with time. You will only show the output of the first store. The code in the notebook generates 6 plots corresponding to 6 of the 45 stores in our dataset.
With this diagram, you can compare the weekly sales over a period of 2 years. It is easy to see sale peaks and trough patterns over time.
Multivariate plots are used to see the interaction between variables. With the visualization, data scientists can see if there are any correlations or patterns between the variables. A common multivariate graph used is a correlation matrix. With a correlation matrix, dependencies between multiple variables are quantified with the correlation coefficient.
Using the same retail dataset, you can generate the correlation matrix.
Notice the diagonal of ones down the center. This shows that when comparing a variable to itself, it has complete positive correlation. Strong positive correlation will have a magnitude closer to 1 while weak correlations will be closer to 0. Negative correlation is shown with a negative coefficient showing an inverse trend.
In this section, feature engineering is used to make modifications to your Retail dataset by performing the following operations:
The current format for date (
2010-02-05) can make it hard to differentiate that the data is for every week. Because of this, you should convert the date to contain week and year.
Now the week and date are as follows:
Next, you want to convert the storeType column to columns representing each
storeType. There are 3 store types, (
C), from which you are creating 3 new columns. The value set in each is a boolean value where a ‘1’ is set depending on what the
storeType was and
0 for the other 2 columns.
storeType column is dropped.
The next modification is to change the
isHoliday boolean to a numerical representation.
Now you want to add previous and future weekly sales to each of your datasets. You can do this by offsetting your
weeklySales. Additionally, the
weeklySales difference is calculated. This is done by subtracting
weeklySales with the previous week’s
Since you are offsetting the
weeklySales data 45 datasets forwards and 45 datasets backwards to create new columns, the first and last 45 data points have NaN values. You can remove these points from your dataset by using the
df.dropna() function which removes all rows that have NaN values.
A summary of the dataset after your modifications is shown below:
Now, it is time to create some models of the data and select which model is the best performer for predicting future sales. You will evaluate the 5 following algorithms:
You need a way to know how accurate your model will be able to predict values. This evaluation can be done by allocating part of dataset to use as validation and the rest as training data. Since
weeklySalesAhead is the actual future values of
weeklySales, you can use this to evaluate how accurate the model is at predicting the value. The splitting is done below:
You now have
y_train for preparing the models and
y_test for evaluation later.
In this section, you declare all the algorithms into an array called
model. Next, you iterate through this array and for each algorithm, input your training data with
model.fit() which creates a model
mdl. Using this model, you can predict
weeklySalesAhead with your
For the scoring, you are taking the mean percentage difference between the predicted
weeklySalesAhead with the actual values in the
y_test data. Since you want to minimize the difference between your prediction and the actual outcome, Gradient Boosting Regressor is the best performing model.
Finally, you visualize your prediction model with the actual weekly sales values. The blue line represents the actual numbers, while the green represents your prediction using Gradient Boosting. The following code generates 6 plots which represent 6 of the 45 stores in your dataset. Only
Store 1 is shown here:
This document covered a general data scientist workflow to solve a retail sales problem. To summarize:
Once you are ready, start by reading the JupyterLab user guide for a quick overview of notebooks in Adobe Experience Platform Data Science Workspace. Additionally, if you are interested in learning about Models and Recipes, start by reading the retail sales schema and dataset tutorial.