连接到 Adobe Experience Platform Query Service
JupyterLabon Platform 允许您在笔记本中使用 Python SQL 通过查询服务🔗访问数据。由于其 Query Service 优越的运行时间,通过访问数据对于处理大型数据集非常有用。 请注意,使用 Query Service 查询数据的处理时间限制为十分钟。
在 中使用 Query Service 之前 JupyterLab, Query Service 请确保您对 SQL 语法有一定的了解。
为了在 JupyterLab 中使用查询服务,您必须首先在工作 Python 笔记本和查询服务之间创建连接。 这可以通过执行以下单元格来实现。
qs_connect()
定义用于探索的midvalues数据集
要开始查询和浏览数据,必须提供midvalues数据集表。 复制table_name
和table_id
值,并将其替换为您自己的数据表值。
target_table = "table_name"
target_table_id = "table_id"
完成后,此单元格应类似于以下示例:
target_table = "cross_industry_demo_midvalues"
target_table_id = "5f7c40ef488de5194ba0157a"
浏览数据集以查看可用日期
使用下面提供的单元格,您可以查看表中包含的日期范围。 探索天数、第一个日期和最后一个日期的目的是帮助选择日期范围以供进一步分析。
%%read_sql -c QS_CONNECTION
SELECT distinct Year(timestamp) as Year, Month(timestamp) as Month, count(distinct DAY(timestamp)) as Count_days, min(DAY(timestamp)) as First_date, max(DAY(timestamp)) as Last_date, count(timestamp) as Count_hits
from {target_table}
group by Month(timestamp), Year(timestamp)
order by Year, Month;
运行单元格会产生以下输出:
配置数据集发现的日期
在确定数据集发现的可用日期后,需要更新以下参数。 此单元格中配置的日期仅用于查询形式的数据发现。 在本指南后面的探索性数据分析中,日期将再次更新到合适的范围。
target_year = "2020" ## The target year
target_month = "02" ## The target month
target_day = "(01,02,03)" ## The target days
数据集发现
配置完所有参数、启动Query Service并设定日期范围后,即可开始读取数据行。 您应该限制读取的行数。
from platform_sdk.dataset_reader import DatasetReader
from datetime import date
dataset_reader = DatasetReader(PLATFORM_SDK_CLIENT_CONTEXT, dataset_id=target_table_id)
# If you do not see any data or would like to expand the default date range, change the following query
Table = dataset_reader.limit(5).read()
要查看数据集中可用的列数,请使用以下单元格:
print("\nNumber of columns:",len(Table.columns))
要查看数据集的行,请使用以下单元格。 在此示例中,行数限制为5。
Table.head(5)
了解数据集中包含哪些数据后,进一步划分数据集会很有价值。 在此示例中,列出了每个列的列名和数据类型,而输出用于检查数据类型是否正确。
ColumnNames_Types = pd.DataFrame(Table.dtypes)
ColumnNames_Types = ColumnNames_Types.reset_index()
ColumnNames_Types.columns = ["Column_Name", "Data_Type"]
ColumnNames_Types
数据集趋势探究
以下部分包含四个用于探索数据趋势和模式的示例查询。 下面提供的示例并非详尽无遗,但涵盖了一些更常见的功能。
给定日期的 小时活动计数
此查询分析一天内操作和点击的次数。 输出以表的形式表示,其中包含一天中每个小时的活动计数量度。
%%read_sql query_2_df -c QS_CONNECTION
SELECT Substring(timestamp, 12, 2) AS Hour,
Count(enduserids._experience.aaid.id) AS Count
FROM {target_table}
WHERE Year(timestamp) = {target_year}
AND Month(timestamp) = {target_month}
AND Day(timestamp) in {target_day}
GROUP BY Hour
ORDER BY Hour;
确认查询工作后,数据可以呈现在单变量直方图中,以便更加清晰地显示。
trace = go.Bar(
x = query_2_df['Hour'],
y = query_2_df['Count'],
name = "Activity Count"
)
layout = go.Layout(
title = 'Activity Count by Hour of Day',
width = 1200,
height = 600,
xaxis = dict(title = 'Hour of Day'),
yaxis = dict(title = 'Count')
)
fig = go.Figure(data = [trace], layout = layout)
iplot(fig)
查询1
给定日期查看的前10个页面
此查询分析给定日期查看次数最多的页面。 输出以表格的形式表示,其中包含有关页面名称和页面查看计数的指标。
%%read_sql query_4_df -c QS_CONNECTION
SELECT web.webpagedetails.name AS Page_Name,
Sum(web.webpagedetails.pageviews.value) AS Page_Views
FROM {target_table}
WHERE Year(timestamp) = {target_year}
AND Month(timestamp) = {target_month}
AND Day(timestamp) in {target_day}
GROUP BY web.webpagedetails.name
ORDER BY page_views DESC
LIMIT 10;
确认查询有效后,数据可以用单变量图直方图中显示,以便直观地显示。
trace = go.Bar(
x = query_4_df['Page_Name'],
y = query_4_df['Page_Views'],
name = "Page Views"
)
layout = go.Layout(
title = 'Top Ten Viewed Pages For a Given Day',
width = 1000,
height = 600,
xaxis = dict(title = 'Page_Name'),
yaxis = dict(title = 'Page_Views')
)
fig = go.Figure(data = [trace], layout = layout)
iplot(fig)
按用户活动分组的前十个城市
此查询分析数据来自哪些城市。
%%read_sql query_6_df -c QS_CONNECTION
SELECT concat(placeContext.geo.stateProvince, ' - ', placeContext.geo.city) AS state_city,
Count(timestamp) AS Count
FROM {target_table}
WHERE Year(timestamp) = {target_year}
AND Month(timestamp) = {target_month}
AND Day(timestamp) in {target_day}
GROUP BY state_city
ORDER BY Count DESC
LIMIT 10;
确认查询工作后,数据可以呈现在单变量直方图中,以便更加清晰地显示。
trace = go.Bar(
x = query_6_df['state_city'],
y = query_6_df['Count'],
name = "Activity by City"
)
layout = go.Layout(
title = 'Top Ten Cities by User Activity',
width = 1200,
height = 600,
xaxis = dict(title = 'City'),
yaxis = dict(title = 'Count')
)
fig = go.Figure(data = [trace], layout = layout)
iplot(fig)
查看的十大产品
此查询提供查看的前10个产品的列表。 在下面的示例中,Explode()
函数用于将productlistitems
对象中的每个产品返回到其自己的行。 这样,您就可以执行嵌套查询来聚合不同SKU的产品视图。
%%read_sql query_7_df -c QS_CONNECTION
SELECT Product_List_Items.sku AS Product_SKU,
Sum(Product_Views) AS Total_Product_Views
FROM (SELECT Explode(productlistitems) AS Product_List_Items,
commerce.productviews.value AS Product_Views
FROM {target_table}
WHERE Year(timestamp) = {target_year}
AND Month(timestamp) = {target_month}
AND Day(timestamp) in {target_day}
AND commerce.productviews.value IS NOT NULL)
GROUP BY Product_SKU
ORDER BY Total_Product_Views DESC
LIMIT 10;
在确认查询工作之后,数据可以在单变量绘图直方图中呈现,以便直观地清晰显示。
trace = go.Bar(
x = "SKU-" + query_7_df['Product_SKU'],
y = query_7_df['Total_Product_Views'],
name = "Product View"
)
layout = go.Layout(
title = 'Top Ten Viewed Products',
width = 1200,
height = 600,
xaxis = dict(title = 'SKU'),
yaxis = dict(title = 'Product View Count')
)
fig = go.Figure(data = [trace], layout = layout)
iplot(fig)
在探索数据的趋势和模式后,您应该清楚知道要为预测目标而构建哪些功能。 快速浏览表格可以快速突出显示每个数据属性的形式、明显的错误表示和值中的大型离群值,并开始建议候选关系以探索属性之间的关系。
探索性数据分析
探索性数据分析用于完善您对数据的理解,并为可以用作建模基础的引人注目的问题建立直觉。
完成数据发现步骤后,您将在事件级别浏览事件级别数据以及事件、城市或用户 ID 级别的一些聚合,以查看一天的趋势。 尽管这些数据很重要,但它并不能提供全貌。 您仍然不了解是什么促使您的网站上进行了购买。
要了解这一点,您需要在配置文件/访问者级别聚合数据,定义购买目标,并应用统计概念,如相关性、框图和散点图。 这些方法用于在您定义的预测窗口中比较买方与非买方的活动模式。
本节将创建并探索以下功能:
COUNT_UNIQUE_PRODUCTS_PURCHASED
:购买的唯一产品数量。COUNT_CHECK_OUTS
:签出次数。COUNT_PURCHASES
:购买次数。COUNT_INSTANCE_PRODUCTADDS
:产品添加实例的数量。NUMBER_VISITS
:访问次数。COUNT_PAID_SEARCHES
:付费搜索数。DAYS_SINCE_VISIT
:自上次访问以来的天数。TOTAL_ORDER_REVENUE
:订单总收入。DAYS_SINCE_PURCHASE
:自上次购买以来的天数。AVG_GAP_BETWEEN_ORDERS_DAYS
:购买之间的平均间隔(以天为单位)。STATE_CITY
:包含州和城市。
在继续数据聚合之前,需要定义用于探索性数据分析中的预测变量的参数。 换句话说,您希望从数据科学模型中得到什么? 常用参数包括目标、预测周期和分析周期。
如果您正在使用EDA笔记本,则需要在继续之前替换以下值。
goal = "commerce.`order`.purchaseID" #### prediction variable
goal_column_type = "numerical" #### choose either "categorical" or "numerical"
prediction_window_day_start = "2020-01-01" #### YYYY-MM-DD
prediction_window_day_end = "2020-01-31" #### YYYY-MM-DD
analysis_period_day_start = "2020-02-01" #### YYYY-MM-DD
analysis_period_day_end = "2020-02-28" #### YYYY-MM-DD
### If the goal is a categorical goal then select threshold for the defining category and creating bins. 0 is no order placed, and 1 is at least one order placed:
threshold = 1
用于创建功能和目标的数据聚合
要开始探索性分析,您需要在用户档案级别创建目标,然后聚合数据集。 在此示例中,提供了两个查询。 第一个查询包含创建目标。 第二个查询需要更新以包括除第一个查询中的变量以外的任何变量。 您可能想要更新查询的limit
。 在执行以下查询后,聚合数据现在可用于探索。
%%read_sql target_df -d -c QS_CONNECTION
SELECT DISTINCT endUserIDs._experience.aaid.id AS ID,
Count({goal}) AS TARGET
FROM {target_table}
WHERE DATE(TIMESTAMP) BETWEEN '{prediction_window_day_start}' AND '{prediction_window_day_end}'
GROUP BY endUserIDs._experience.aaid.id;
%%read_sql agg_data -d -c QS_CONNECTION
SELECT z.*, z1.state_city as STATE_CITY
from
((SELECT y.*,a2.AVG_GAP_BETWEEN_ORDERS_DAYS as AVG_GAP_BETWEEN_ORDERS_DAYS
from
(select a1.*, f.DAYS_SINCE_PURCHASE as DAYS_SINCE_PURCHASE
from
(SELECT DISTINCT a.ID AS ID,
COUNT(DISTINCT Product_Items.SKU) as COUNT_UNIQUE_PRODUCTS_PURCHASED,
COUNT(a.check_out) as COUNT_CHECK_OUTS,
COUNT(a.purchases) as COUNT_PURCHASES,
COUNT(a.product_list_adds) as COUNT_INSTANCE_PRODUCTADDS,
sum(CASE WHEN a.search_paid = 'TRUE' THEN 1 ELSE 0 END) as COUNT_PAID_SEARCHES,
DATEDIFF('{analysis_period_day_end}', MAX(a.date_a)) as DAYS_SINCE_VISIT,
ROUND(SUM(Product_Items.priceTotal * Product_Items.quantity), 2) AS TOTAL_ORDER_REVENUE
from
(SELECT endUserIDs._experience.aaid.id as ID,
commerce.`checkouts`.value as check_out,
commerce.`order`.purchaseID as purchases,
commerce.`productListAdds`.value as product_list_adds,
search.isPaid as search_paid,
DATE(TIMESTAMP) as date_a,
Explode(productlistitems) AS Product_Items
from {target_table}
Where DATE(TIMESTAMP) BETWEEN '{analysis_period_day_start}' AND '{analysis_period_day_end}') as a
group by a.ID) as a1
left join
(SELECT DISTINCT endUserIDs._experience.aaid.id as ID,
DATEDIFF('{analysis_period_day_end}', max(DATE(TIMESTAMP))) as DAYS_SINCE_PURCHASE
from {target_table}
where DATE(TIMESTAMP) BETWEEN '{analysis_period_day_start}' AND '{analysis_period_day_end}'
and commerce.`order`.purchaseid is not null
GROUP BY endUserIDs._experience.aaid.id) as f
on f.ID = a1.ID
where a1.COUNT_PURCHASES>0) as y
left join
(select ab.ID, avg(DATEDIFF(ab.ORDER_DATES, ab.PriorDate)) as AVG_GAP_BETWEEN_ORDERS_DAYS
from
(SELECT distinct endUserIDs._experience.aaid.id as ID, TO_DATE(DATE(TIMESTAMP)) as ORDER_DATES,
TO_DATE(LAG(DATE(TIMESTAMP),1) OVER (PARTITION BY endUserIDs._experience.aaid.id ORDER BY DATE(TIMESTAMP))) as PriorDate
FROM {target_table}
where DATE(TIMESTAMP) BETWEEN '{analysis_period_day_start}' AND '{analysis_period_day_end}'
AND commerce.`order`.purchaseid is not null) AS ab
where ab.PriorDate is not null
GROUP BY ab.ID) as a2
on a2.ID = y.ID) z
left join
(select t.ID, t.state_city from
(
SELECT DISTINCT endUserIDs._experience.aaid.id as ID,
concat(placeContext.geo.stateProvince, ' - ', placeContext.geo.city) as state_city,
ROW_NUMBER() OVER(PARTITION BY endUserIDs._experience.aaid.id ORDER BY DATE(TIMESTAMP) DESC) AS ROWNUMBER
FROM {target_table}
WHERE DATE(TIMESTAMP) BETWEEN '{analysis_period_day_start}' AND '{analysis_period_day_end}') as t
where t.ROWNUMBER = 1) z1
on z.ID = z1.ID)
limit 500000;