探索資料分析
本檔案提供一些使用資料Distiller來探索和分析Python筆記本資料的基本範例和最佳實務。
快速入門
繼續本指南之前,請確定您已在Python筆記本中建立與Data Distiller的連線。 請參閱檔案,瞭解如何將 Python 筆記本連線至Data Distiller的說明。
取得基本統計資料 basic-statistics
使用以下程式碼來擷取資料集中的列數和相異設定檔數。
table_name = 'ecommerce_events'
basic_statistics_query = f"""
SELECT
COUNT(_id) as "totalRows",
COUNT(DISTINCT _id) as "distinctUsers"
FROM {table_name}"""
df = qs_cursor.query(basic_statistics_query, output="dataframe")
df
範例輸出
建立大型資料集的取樣版本 create-dataset-sample
如果您要查詢的資料集非常大,或不需要探索查詢的精確結果,請使用資料Distiller查詢可用的取樣功能。 此程式分為兩個步驟:
- 首先,分析 資料集,以使用指定的取樣比率建立取樣版本
- 接下來,查詢資料集的抽樣版本。 根據您套用至樣本資料集的函式,您可能想要將輸出縮放為數字以取得完整資料集
建立5%樣本 create-sample
以下範例會分析資料集並建立5%範例:
# A sampling rate of 10 is 100% in Query Service, so for 5% use a sampling rate 0.5
sampling_rate = 0.5
analyze_table_query=f"""
SET aqp=true;
ANALYZE TABLE {table_name} TABLESAMPLE SAMPLERATE {sampling_rate}"""
qs_cursor.query(analyze_table_query, output="raw")
檢視您的範例 view-sample
您可以使用sample_meta
函式來檢視從指定資料集建立的任何樣本。 下列程式碼片段示範如何使用sample_meta
函式。
sampled_version_of_table_query = f'''SELECT sample_meta('{table_name}')'''
df_samples = qs_cursor.query(sampled_version_of_table_query, output="dataframe")
df_samples
範例輸出:
查詢您的範例 query-sample-data
您可以從傳回的中繼資料中參照範例表格名稱,直接查詢範例。 然後,您可將結果乘以取樣比例,以取得預估值。
sample_table_name = df_samples[df_samples["sampling_rate"] == sampling_rate]["sample_table_name"].iloc[0]
count_query=f'''SELECT count(*) as cnt from {sample_table_name}'''
df = qs_cursor.query(count_query, output="dataframe")
# Divide by the sampling rate to extrapolate to the full dataset
approx_count = df["cnt"].iloc[0] / (sampling_rate / 100)
print(f"Approximate count: {approx_count} using {sampling_rate *10}% sample")
範例輸出
Approximate count: 1284600.0 using 5.0% sample
電子郵件漏斗分析 email-funnel-analysis
漏斗分析是瞭解達成目標結果所需步驟,以及有多少使用者完成每個步驟的方法。 以下範例說明通往使用者訂閱電子報的步驟的簡單漏斗分析。 訂閱結果以web.formFilledOut
的事件型別表示。
首先,執行查詢以取得每個步驟的使用者人數。
simple_funnel_analysis_query = f'''SELECT eventType, COUNT(DISTINCT _id) as "distinctUsers",COUNT(_id) as "distinctEvents" FROM {table_name} GROUP BY eventType ORDER BY distinctUsers DESC'''
funnel_df = qs_cursor.query(simple_funnel_analysis_query, output="dataframe")
funnel_df
範例輸出
繪製查詢結果 plot-results
接下來,使用Python plotly
資料庫繪製查詢結果:
import plotly.express as px
email_funnel_events = ["directMarketing.emailSent", "directMarketing.emailOpened", "directMarketing.emailClicked", "web.formFilledOut"]
email_funnel_df = funnel_df[funnel_df["eventType"].isin(email_funnel_events)]
fig = px.funnel(email_funnel_df, y='eventType', x='distinctUsers')
fig.show()
範例輸出
事件關聯 event-correlations
另一種常見分析是計算事件型別與目標轉換事件型別之間的關聯。 在此範例中,訂閱事件由web.formFilledOut
表示。 此範例使用資料Distiller查詢中可用的Spark函式來達成下列步驟:
- 依設定檔計算每種事件型別的事件數。
- 彙總設定檔中每個事件型別的計數,並計算每個事件型別與
web,formFilledOut
的相關性。 - 將計數和相關性的資料流轉換為每個功能的Pearson相關係數(事件型別計數)表格與目標事件。
- 將結果以視覺化方式顯示在繪圖中。
Spark函式彙總資料以傳回小型結果表格,因此您可以在完整資料集上執行這類查詢。
large_correlation_query=f'''
SELECT SUM(webFormsFilled) as webFormsFilled_totalUsers,
SUM(advertisingClicks) as advertisingClicks_totalUsers,
SUM(productViews) as productViews_totalUsers,
SUM(productPurchases) as productPurchases_totalUsers,
SUM(propositionDismisses) as propositionDismisses_totaUsers,
SUM(propositionDisplays) as propositionDisplays_totaUsers,
SUM(propositionInteracts) as propositionInteracts_totalUsers,
SUM(emailClicks) as emailClicks_totalUsers,
SUM(emailOpens) as emailOpens_totalUsers,
SUM(webLinkClicks) as webLinksClicks_totalUsers,
SUM(webPageViews) as webPageViews_totalusers,
corr(webFormsFilled, emailOpens) as webForms_EmailOpens,
corr(webFormsFilled, advertisingClicks) as webForms_advertisingClicks,
corr(webFormsFilled, productViews) as webForms_productViews,
corr(webFormsFilled, productPurchases) as webForms_productPurchases,
corr(webFormsFilled, propositionDismisses) as webForms_propositionDismisses,
corr(webFormsFilled, propositionInteracts) as webForms_propositionInteracts,
corr(webFormsFilled, emailClicks) as webForms_emailClicks,
corr(webFormsFilled, emailOpens) as webForms_emailOpens,
corr(webFormsFilled, emailSends) as webForms_emailSends,
corr(webFormsFilled, webLinkClicks) as webForms_webLinkClicks,
corr(webFormsFilled, webPageViews) as webForms_webPageViews
FROM(
SELECT _{tenant_id}.cmle_id as userID,
SUM(CASE WHEN eventType='web.formFilledOut' THEN 1 ELSE 0 END) as webFormsFilled,
SUM(CASE WHEN eventType='advertising.clicks' THEN 1 ELSE 0 END) as advertisingClicks,
SUM(CASE WHEN eventType='commerce.productViews' THEN 1 ELSE 0 END) as productViews,
SUM(CASE WHEN eventType='commerce.productPurchases' THEN 1 ELSE 0 END) as productPurchases,
SUM(CASE WHEN eventType='decisioning.propositionDismiss' THEN 1 ELSE 0 END) as propositionDismisses,
SUM(CASE WHEN eventType='decisioning.propositionDisplay' THEN 1 ELSE 0 END) as propositionDisplays,
SUM(CASE WHEN eventType='decisioning.propositionInteract' THEN 1 ELSE 0 END) as propositionInteracts,
SUM(CASE WHEN eventType='directMarketing.emailClicked' THEN 1 ELSE 0 END) as emailClicks,
SUM(CASE WHEN eventType='directMarketing.emailOpened' THEN 1 ELSE 0 END) as emailOpens,
SUM(CASE WHEN eventType='directMarketing.emailSent' THEN 1 ELSE 0 END) as emailSends,
SUM(CASE WHEN eventType='web.webinteraction.linkClicks' THEN 1 ELSE 0 END) as webLinkClicks,
SUM(CASE WHEN eventType='web.webinteraction.pageViews' THEN 1 ELSE 0 END) as webPageViews
FROM {table_name}
GROUP BY userId
)
'''
large_correlation_df = qs_cursor.query(large_correlation_query, output="dataframe")
large_correlation_df
範例輸出:
將列轉換為事件型別關聯 event-type-correlation
接下來,將以上查詢輸出中的單一資料列轉換為表格,顯示每個事件型別與目標訂閱事件的關聯:
cols = large_correlation_df.columns
corrdf = large_correlation_df[[col for col in cols if ("webForms_" in col)]].melt()
corrdf["feature"] = corrdf["variable"].apply(lambda x: x.replace("webForms_", ""))
corrdf["pearsonCorrelation"] = corrdf["value"]
corrdf.fillna(0)
範例輸出:
webForms_EmailOpens
webForms_advertisingClicks
webForms_productViews
webForms_productPurchases
webForms_propositionDismisses
webForms_propositionInteracts
webForms_emailClicks
webForms_emailOpens
webForms_emailSends
webForms_webLinkClicks
webForms_webPageViews
最後,您可以視覺化與matplotlib
Python資料庫的關聯:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(5,10))
sns.barplot(data=corrdf.fillna(0), y="feature", x="pearsonCorrelation")
ax.set_title("Pearson Correlation of Events with the outcome event")
後續步驟
閱讀本檔案後,您已瞭解如何使用Data Distiller來探索和分析Python筆記本中的資料。 在機器學習環境中建立從Experience Platform到摘要自訂模型的功能管道的下一個步驟是為機器學習設計功能。