Choose your workspace
When launching JupyterLab, we are presented with a web-based interface for Jupyter Notebooks. Depending on which type of notebook we pick, a corresponding kernel will be launched.
When comparing which environment to use we must consider each service’s limitations. For example, if we are using the pandas library with Python, as a regular user the RAM limit is 2 GB. Even as a power user, we would be limited to 20 GB of RAM. If dealing with larger computations, it would make sense to use Spark which offers 1.5 TB that is shared with all notebook instances.
By default, Tensorflow recipe work in a GPU cluster and Python runs within a CPU cluster.
Create a new notebook
In the Adobe Experience Platform UI, select Data Science in the top menu to take you to the Data Science Workspace. From this page, select JupyterLab to open the JupyterLab launcher. You should see a page similar to this.
In our tutorial, we will be using Python 3 in the Jupyter Notebook to show how to access and explore the data. In the Launcher page, there are sample notebooks provided. We will be using the Retail Sales recipe for Python 3.
The Retail Sales recipe is a standalone example which uses the same Retail Sales dataset to show how data can be explored and visualized in Jupyter Notebook. Additionally, the notebook goes further in depth with training and verification. More information about this specific notebook can be found in this walkthrough.
Access data
data_access_sdk_python
is deprecated and no longer recommended. Please refer to the converting data access SDK to Platform SDK tutorial to convert your code. The same steps below still apply for this tutorial.We will go over accessing data internally from Adobe Experience Platform and data externally. We will be using the data_access_sdk_python
library to access internal data such as datasets and XDM schemas. For external data, we will use the pandas Python library.
External data
With the Retail Sales notebook opened, find the “Load Data” header. The following Python code uses pandas’ DataFrame
data structure and the read_csv() function to read the CSV hosted on Github into the DataFrame:
Pandas’ DataFrame data structure is a 2-dimensional labeled data structure. To quickly see the dimensions of our data, we can use the df.shape
. This returns a tuple that represents the dimensionality of the DataFrame:
Finally, we can take a peek at what our data looks like. We can use df.head(n)
to view the first n
rows of the DataFrame:
Experience Platform data
Now, we will go over accessing Experience Platform data.
By Dataset ID
For this section, we are using the Retail Sales dataset which is the same dataset used in the Retail Sales sample notebook.
In Jupyter Notebook, you can access your data from the Data tab
Now in the Datasets directory, you can see all the ingested datasets. Note that it may take a minute to load all the entries if your directory is heavily populated with datasets.
Since the dataset is the same, we want to replace the load data from the previous section which uses external data. Select the code block under Load Data and press the ‘d’ key on your keyboard twice. Make sure the focus is on the block and not in the text. You can press ‘esc’ to escape the text focus before pressing ‘d’ twice.
Now, we can right click on the Retail-Training-<your-alias>
dataset and select the “Explore Data in Notebook” option in the dropdown. An executable code entry will appear in your notebook.
from data_access_sdk_python.reader import DataSetReader
from datetime import date
reader = DataSetReader()
df = reader.load(data_set_id="xxxxxxxx", ims_org="xxxxxxxx@AdobeOrg")
df.head()
If you are working on other kernels other than Python, please refer to this page to access data on the Adobe Experience Platform.
Selecting the executable cell then pressing the play button in the toolbar will run the executable code. The output for head()
will be be a table with your dataset’s keys as columns and the first n rows in the dataset. head()
accepts an integer argument to specify how many lines to output. By default this is 5.
If you restart your kernel and run all the cells again, you should get the same outputs as before.