Data Science Workspace troubleshooting guide

This document provides answers to frequently asked questions about Adobe Experience Platform Data Science Workspace. For questions and troubleshooting regarding Platform APIs in general, see the Adobe Experience Platform API troubleshooting guide.

JupyterLab environment is not loading in Google Chrome

IMPORTANT

This issue has been resolved but could still be present in the Google Chrome 80.x browser. Please ensure your Chrome browser is up-to-date.

With the Google Chrome browser version 80.x, all third-party cookies are blocked by default. This policy can prevent JupyterLab from loading within Adobe Experience Platform.

To remedy this issue use the following steps:

In your Chrome browser, navigate to the top-right and select Settings (alternatively you can copy and paste “chrome://settings/” in the address bar). Next, scroll to the bottom of the page and click the Advanced dropdown.

chrome advanced

The Privacy and security section appears. Next, click on Site settings followed by Cookies and site data.

chrome advanced

chrome advanced

Lastly, toggle “Block third-party cookies” to “OFF”.

chrome advanced

NOTE

Alternatively, you could disable third-party cookies and add [*.]ds.adobe.net to the allow list.

Navigate to “chrome://flags/” in your address bar. Search for and disable the flag titled “SameSite by default cookies” by using the dropdown menu on the right.

disable samesite flag

After Step 2, you are prompted to relaunch your browser. After you relaunch, Jupyterlab should be accessible.

Why am I unable to access JupyterLab in Safari?

Safari disables third-party cookies by default in Safari < 12. Because your Jupyter virtual machine instance resides on a different domain than its parent frame, Adobe Experience Platform currently requires that third-party cookies be enabled. Please enable third-party cookies or switch to a different browser such as Google Chrome.

For Safari 12, you need to switch your User Agent to ‘Chrome’ or ‘Firefox’. To switch your User Agent, start by opening the Safari menu and select Preferences. The preferences window appears.

Safari preferences

Within the Safari preferences window, select Advanced. Then check the Show Develop menu in menu bar box. You can close the preferences window after this step is complete.

Safari advanced

Next, from the top navigation bar select the Develop menu. From within the Develop dropdown, hover over User Agent. You can select the Chrome or Firefox User Agent string you would like to use.

Develop menu

Why am I seeing a ‘403 Forbidden’ message when trying to upload or delete a file in JupyterLab?

If your browser is enabled with advertisement blocking software such as Ghostery or AdBlock Plus, the domain “*.adobe.net” must be allowed in each advertisement blocking software for JupyterLab to operate normally. This is because JupyterLab virtual machines run on a different domain than the Experience Platform domain.

Why do some parts of my Jupyter Notebook look scrambled or do not render as code?

This can happen if the cell in question is accidentally changed from “Code” to “Markdown”. While a code cell is focused, pressing the key combination ESC+M changes the type of the cell to Markdown. A cell’s type can be changed by the dropdown indicator at the top of the notebook for the selected cell(s). To change a cell type to code, start by selecting the given cell you want to change. Next, click the dropdown that indicates the cell’s current type, then select “Code”.

How do I install custom Python libraries?

The Python kernel comes pre-installed with many popular machine learning libraries. However, you can install additional custom libraries by executing the following command within a code cell:

!pip install {LIBRARY_NAME}

For a complete list of pre-installed Python libraries, see the appendix section of the JupyterLab User Guide.

Can I install custom PySpark libraries?

Unfortunately, you cannot install additional libraries for the PySpark kernel. However, you can contact your Adobe customer service representative to have custom PySpark libraries installed for you.

For a list of pre-installed PySpark libraries, see the appendix section of the JupyterLab User Guide.

Is it possible to configure Spark cluster resources for JupyterLab Spark or PySpark kernel?

You can configure resources by adding the following block to the first cell of your notebook:

%%configure -f 
{
    "numExecutors": 10,
    "executorMemory": "8G",
    "executorCores":4,
    "driverMemory":"2G",
    "driverCores":2,
    "conf": {
        "spark.cores.max": "40"
    }
}

For more information on Spark cluster resource configuration, including the complete list of configurable properties, see the JupyterLab User Guide.

Why am I receiving an error when trying execute certain tasks for larger datasets?

If you are receiving an error with a reason such as Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. This typically means the driver or an executor is running out of memory. See the JupyterLab Notebooks data access documentation for more information on data limits and how to execute tasks on large datasets. Typically this error can be solved by changing the mode from interactive to batch.

Additionally, while writing large Spark/PySpark datasets, caching your data (df.cache()) before executing the write code can greatly improve performance.

If you are experiencing problem while reading data and are applying transformations to the data, try caching your data before the transformations. Caching your data prevents multiple reads across the network. Start by reading the data. Next, cache (df.cache()) the data. Lastly, perform a your transformations.

Why are my Spark/PySpark notebooks taking so long to read and write data?

If you are performing transformations on data, such as using fit(), the transformations may be executing multiple times. To increase performance, cache your data using df.cache() before performing the fit(). This ensures that the transformations are only executed a single time and prevents multiple reads across the network.

Recommended order: Start by reading the data. Next, perform transformations followed by caching (df.cache()) the data. Lastly, perform a fit().

Why are my Spark/PySpark notebooks failing to run?

If you are receiving any of the following errors:

  • Job aborted due to stage failure … Can only zip RDDs with same number of elements in each partition.
  • Remote RPC client disassociated and other memory errors.
  • Poor performance when reading and writing datasets.

Check to make sure you are caching the data (df.cache()) before writing the data. When executing code in notebooks, using df.cache() before an action such as fit() can greatly improve notebook performance. Using df.cache() before writing a dataset ensures that the transformations are only executed a single time instead of multiple times.

Docker Hub limit restrictions in Data Science Workspace

As of November 20, 2020, rate limits for anonymous and free authenticated use of Docker Hub went into effect. Anonymous and Free Docker Hub users are limited to 100 container image pull requests every six hours. If you are affected by these changes you will receive this error message: ERROR: toomanyrequests: Too Many Requests. or You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limits..

Currently, this limit will only affect your organization if you are attempting to build 100 Notebook to Recipes within the six hour timeframe or if you are using Spark based Notebooks within Data Science Workspace that are frequently scaling up and down. However, this is unlikely, since the cluster these run on remain active for two hours before idling out. This reduces the number of pulls required when the cluster is active. If you receive any of the above errors, you will need to wait until your Docker limit is reset.

For more information about Docker Hub rate limits, visit the DockerHub documentation. A solution for this is being worked on and expected in a subsequent release.

On this page