Connect to Data Distiller from a Jupyter Notebook

To enrich your machine learning pipelines with high-value customer experience data, you must first connect to Data Distiller from Jupyter Notebooks. This document covers the steps to connect to Data Distiller from a Python notebook in your machine learning environment.

Getting started

This guide assumes that you are familiar with interactive Python notebooks and have access to a notebook environment. The notebook may be hosted within a cloud-based machine learning environment, or locally with Jupyter Notebook.

Obtain connection credentials obtain-credentials

To connect to Data Distiller and other Adobe Experience Platform services, you need an Experience Platform API credential. API credentials can be created in the Adobe Developer Console by someone with Developer access to the Experience Platform. You are recommended to create an Oauth2 API credential specifically for data science workflows and have an Adobe system admin from your organization assign the credential to a role with appropriate permissions.

See Authenticate and access Experience Platform APIs for detailed instructions on creating an API credential and obtaining the required permissions.

Recommended permissions for data science include:

  • Sandbox(es) that will be used for data science (usually prod)
  • Data modeling: Manage Schemas
  • Data management: Manage Datasets
  • Data ingestion: View Sources
  • Destinations: Manage and Activate Dataset Destinations
  • Query Service: Manage Queries

By default, a role (and API credentials assigned to that role) is blocked from accessing any labeled data. Subject to the organization’s data governance policies, a System Admin may grant the role access to certain labeled data that is deemed appropriate for data science usage. Platform customers are responsible to manage label access and policies appropriately in order to comply with relevant regulations and organizational policies.

Store credentials in a separate configuration file store-credentials

To keep your credential secure, it is recommended that you avoid writing credential information directly into your code. Instead, keep the credential information in a separate configuration file and read in the values needed to connect to the Experience Platform and Data Distiller.

As an example, you can create a file called config.ini and include the following information (along with any other information, such as dataset IDs, that would useful to save between sessions):

scopes=openid, AdobeID, read_organizations, additional_info.projectedProductContext, session

In your notebook, you can then read the credential information into memory using the configParser package from the standard Python library:

from configparser import ConfigParser

# Create a ConfigParser object to read and store information from config.ini
config = ConfigParser()
config_path = '<PATH_TO_YOUR_CONFIG.INI_FILE>'

You can then reference credential values within your code as follows:

org_id = config.get('Credential', 'ims_org_id')

Install aepp Python library install-python-library

aepp is an Adobe-managed open-source Python library that provides functions for connecting to Data Distiller and submitting queries, as making requests to other Experience Platform services. The aepp library in turn relies on the PostgreSQL database adapter package psycopg2 for interactive Data Distiller queries. It is possible to connect to Data Distiller and query Experience Platform datasets with psycopg2 alone, but aepp provides greater convenience and additional functionality to make requests to all Experience Platform API services.

To install or upgrade aepp and psycopg2 in your environment, you can use the %pip magic command in your notebook:

%pip install --upgrade aepp
%pip install --upgrade psycopg2-binary

You can then configure the aepp library with your credential using the following code:

from configparser import ConfigParser

# Create a ConfigParser object to read and store information from config.ini
config = ConfigParser()
config_path = '<PATH_TO_YOUR_CONFIG.INI_FILE>'

# Configure aepp with your credentials
import aepp

  org_id=config.get('Credential', 'ims_org_id'),
  sandbox=config.get('Credential', 'sandbox_name'),
  client_id=config.get('Credential', 'client_id'),
  secret=config.get('Credential', 'client_secret'),
  scopes=config.get('Credential', 'scopes'),
  tech_id=config.get('Credential', 'tech_acct_id')

Create a connection to Data Distiller create-connection

Once aepp is configured with your credentials, you can use the following code to create a connection to Data Distiller and start an interactive session as follows:

from aepp import queryservice

dd_conn = queryservice.QueryService().connection()
dd_cursor = queryservice.InteractiveQuery2(dd_conn)

You can then query the datasets in your Experience Platform sandbox. Given the id of a dataset you want to query, you can retrieve the corresponding table name from the Catalog service and run queries on the table:

table_name = 'ecommerce_events'
simple_query = f'''SELECT * FROM {table_name} LIMIT 5'''

Connect to a single dataset for faster query performance connect-to-single-dataset

By default, the Data Distiller connection connects to all datasets in your sandbox. For faster queries and reduced resource usage, you can instead connect to a specific dataset of interest. You can do this by changing the dbname in the Data Distiller connection object to {sandbox}:{table_name}:

from aepp import queryservice

sandbox = config.get('Credential', 'sandbox_name')
table_name = 'ecommerce_events'

dd_conn = queryservice.QueryService().connection()
dd_conn['dbname'] = f'{sandbox}:{table_name}'
dd_cursor = queryservice.InteractiveQuery2(dd_conn)

Next Steps

By reading this document, you have learned how to connect to Data Distiller from a Python notebook in your machine learning environment. The next step in creating feature pipelines from Experience Platform to feed custom models in your machine learning environment is to explore and analyze your datasets.