머신 러닝을 위한 엔지니어 기능

이 문서에서는 Adobe Experience Platform의 데이터를 기계 학습 모델에서 사용할 수 있는 기능 또는 변수로 변환하는 방법을 보여 줍니다. 이 프로세스를 기능 엔지니어링 ​이라고 합니다. Data Distiller을 사용하여 규모에 맞게 ML 기능을 계산하고 이러한 기능을 머신 러닝 환경에 공유할 수 있습니다. 여기에는 다음 항목이 포함됩니다.

  1. 쿼리 템플릿을 만들어 모델에 대해 계산할 대상 레이블과 기능을 정의합니다
  2. 쿼리를 실행하고 결과를 교육 데이터 세트에 저장

교육 데이터 정의 define-training-data

다음 예제에서는 사용자의 뉴스레터 구독 성향을 예측하기 위해 모델에 대한 경험 이벤트 데이터 세트에서 교육 데이터를 도출하는 쿼리를 보여 줍니다. 구독 이벤트는 이벤트 유형 web.formFilledOut에 의해 표시되고 데이터 집합의 다른 동작 이벤트는 구독을 예측하는 프로필 수준 기능을 파생하는 데 사용됩니다.

양수 및 음수 레이블 쿼리 query-positive-and-negative-labels

(감독된) 기계 학습 모델을 교육하기 위한 전체 데이터 세트에는 예측될 결과를 나타내는 대상 변수 또는 레이블과 모델을 교육하는 데 사용되는 예제 프로필을 설명하는 데 사용되는 기능 또는 설명 변수 세트가 포함됩니다.

이 경우 레이블은 subscriptionOccurred이라는 변수입니다. 이 변수는 사용자 프로필에 유형이 web.formFilledOut인 이벤트가 있는 경우 1이고, 그렇지 않은 경우 0입니다. 다음 쿼리는 양의 레이블(subscriptionOccurred = 1)을 가진 모든 사용자와 음의 레이블을 가진 임의로 선택한 사용자 집합을 포함하여 이벤트 데이터 집합에서 50,000명의 사용자 집합을 반환하여 50,000명의 사용자 샘플 크기를 완료합니다. 이렇게 하면 학습 데이터에 모델이 학습할 긍정적인 사례와 부정적인 예가 모두 포함됩니다.

from aepp import queryservice

dd_conn = queryservice.QueryService().connection()
dd_cursor = queryservice.InteractiveQuery2(dd_conn)

query_labels = f"""
SELECT *
FROM (
    SELECT
        eventType,
        _{tenant_id}.user_id as userId,
        SUM(CASE WHEN eventType='web.formFilledOut' THEN 1 ELSE 0 END)
            OVER (PARTITION BY _{tenant_id}.user_id)
            AS "subscriptionOccurred",
        row_number() OVER (PARTITION BY _{tenant_id}.user_id ORDER BY randn()) AS random_row_number_for_user
    FROM {table_name}
)
WHERE (subscriptionOccurred = 1 AND eventType = 'web.formFilledOut') OR (subscriptionOccurred = 0 AND random_row_number_for_user = 1)
"""

df_labels = dd_cursor.query(query_labels, output="dataframe")
print(f"Number of classes: {len(df_labels)}")
df_labels.head()

샘플 출력

클래스 수: 50000

eventType
userId
subscriptionOccured
random_row_number_for_user
0
directMarketing.emailClicked
01027994177972439148069092698714414382
0
1
1
directMarketing.emailOpened
01054714817856066632264746967668888198
0
1
2
web.formFilledOut
01117296890525140996735553609305695042
1
15
3
directMarketing.emailClicked
01149554820363915324573708359099551093
0
1
4
directMarketing.emailClicked
01172121447143590196349410086995740317
0
1

이벤트를 집계하여 ML용 기능 정의 define-features

적절한 쿼리를 통해 데이터 세트의 이벤트를 취합하여 성향 모델을 교육하는 데 사용할 수 있는 의미 있는 숫자 기능으로 만들 수 있습니다. 예시 이벤트는 아래에 나와 있습니다.

  • 마케팅 용도로 보내고 사용자가 받은 전자 메일 수 ​입니다.
  • 열림 ​된 전자 메일 중 일부입니다.
  • 사용자가 링크를 선택 ​한 전자 메일 중 일부입니다.
  • 조회된 제품 수.
  • 과(와) 상호 작용한 ​제안 수입니다.
  • 해제된 제안 수.
  • 선택한 링크 수.
  • 두 개의 연속 이메일 수신 시간(분)
  • 두 개의 연속 이메일이 열린 시간(분).
  • 사용자가 실제로 링크를 선택한 두 개의 연속 이메일 사이의 시간(분).
  • 두 개의 연속 제품 보기 간 시간(분).
  • 상호 작용한 두 제안 사이의 시간(분).
  • 기각된 두 제안 사이의 분 수.
  • 선택한 두 링크 간의 시간(분).

다음 쿼리는 이러한 이벤트를 집계합니다.

예제 쿼리를 보려면 선택
code language-python
query_features = f"""
SELECT
    _{tenant_id}.user_id as userId,
    SUM(CASE WHEN eventType='directMarketing.emailSent' THEN 1 ELSE 0 END)
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "emailsReceived",
    SUM(CASE WHEN eventType='directMarketing.emailOpened' THEN 1 ELSE 0 END)
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "emailsOpened",
    SUM(CASE WHEN eventType='directMarketing.emailClicked' THEN 1 ELSE 0 END)
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "emailsClicked",
    SUM(CASE WHEN eventType='commerce.productViews' THEN 1 ELSE 0 END)
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "productsViewed",
    SUM(CASE WHEN eventType='decisioning.propositionInteract' THEN 1 ELSE 0 END)
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "propositionInteracts",
    SUM(CASE WHEN eventType='decisioning.propositionDismiss' THEN 1 ELSE 0 END)
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "propositionDismissed",
    SUM(CASE WHEN eventType='web.webinteraction.linkClicks' THEN 1 ELSE 0 END)
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "webLinkClicks" ,
    TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailSent', 'minutes')
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "minutes_since_emailSent",
    TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailOpened', 'minutes')
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "minutes_since_emailOpened",
    TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailClicked', 'minutes')
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "minutes_since_emailClick",
    TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'commerce.productViews', 'minutes')
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "minutes_since_productView",
    TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'decisioning.propositionInteract', 'minutes')
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "minutes_since_propositionInteract",
    TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'propositionDismiss', 'minutes')
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "minutes_since_propositionDismiss",
    TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'web.webinteraction.linkClicks', 'minutes')
       OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS "minutes_since_linkClick"
FROM {table_name}
"""

df_features = dd_cursor.query(query_features, output="dataframe")
df_features.head()

샘플 출력

userId
이메일 수신됨
이메일 열림
이메일 클릭됨
productsViewed
propositionInteract
propositionDismissed
webLinkClicks
minutes_since_emailSent
minutes_since_emailOpened
minutes_since_emailClick
minutes_since_productView
minutes_since_propositionInteract
minutes_since_propositionDismiss
minutes_since_linkClick
0
01102546977582484968046916668339306826
1
0
0
0
0
0
0
0.0
Nan
Nan
Nan
Nan
None
Nan
1
01102546977582484968046916668339306826
2
0
0
0
0
0
0
0.0
Nan
Nan
Nan
Nan
None
Nan
2
01102546977582484968046916668339306826
3
0
0
0
0
0
0
0.0
Nan
Nan
Nan
Nan
None
Nan
3
01102546977582484968046916668339306826
3
1
0
0
0
0
0
540.0
0.0
Nan
Nan
Nan
None
Nan
4
01102546977582484968046916668339306826
3
2
0
0
0
0
0
588.0
0.0
Nan
Nan
Nan
None
Nan

레이블 및 기능 쿼리 결합 combine-queries

마지막으로 레이블 쿼리 및 기능 쿼리를 결합하여 레이블 및 기능의 교육 데이터 세트를 반환하는 단일 쿼리로 만들 수 있습니다.

예제 쿼리를 보려면 선택
code language-python
query_training_set = f"""
SELECT *
FROM (
    SELECT _{tenant_id}.user_id as userId,
       eventType,
       timestamp,
       SUM(CASE WHEN eventType='web.formFilledOut' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id)
           AS "subscriptionOccurred",
       SUM(CASE WHEN eventType='directMarketing.emailSent' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "emailsReceived",
       SUM(CASE WHEN eventType='directMarketing.emailOpened' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "emailsOpened",
       SUM(CASE WHEN eventType='directMarketing.emailClicked' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "emailsClicked",
       SUM(CASE WHEN eventType='commerce.productViews' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "productsViewed",
       SUM(CASE WHEN eventType='decisioning.propositionInteract' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "propositionInteracts",
       SUM(CASE WHEN eventType='decisioning.propositionDismiss' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "propositionDismissed",
       SUM(CASE WHEN eventType='web.webinteraction.linkClicks' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "webLinkClicks" ,
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailSent', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_emailSent",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailOpened', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_emailOpened",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailClicked', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_emailClick",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'commerce.productViews', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_productView",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'decisioning.propositionInteract', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_propositionInteract",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'propositionDismiss', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_propositionDismiss",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'web.webinteraction.linkClicks', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_linkClick",
        row_number() OVER (PARTITION BY _{tenant_id}.user_id ORDER BY randn()) AS random_row_number_for_user
    FROM {table_name} LIMIT 1000
)
WHERE (subscriptionOccurred = 1 AND eventType = 'web.formFilledOut') OR (subscriptionOccurred = 0 AND random_row_number_for_user = 1)
ORDER BY timestamp
"""

df_training_set = dd_cursor.query(query_training_set, output="dataframe")
df_training_set.head()

샘플 출력

userId
eventType
타임스탬프
subscriptionOccured
이메일 수신됨
이메일 열림
이메일 클릭됨
productsViewed
propositionInteract
propositionDismissed
webLinkClicks
minutes_since_emailSent
minutes_since_emailOpened
minutes_since_emailClick
minutes_since_productView
minutes_since_propositionInteract
minutes_since_propositionDismiss
minutes_since_linkClick
random_row_number_for_user
0
02554909162592418347780983091131567290
directMarketing.emailSent
2023-06-17 13:44:59.086
0
2
0
0
0
0
0
0
0.0
Nan
Nan
Nan
Nan
None
Nan
1
1
01130334080340815140184601481559659945
directMarketing.emailOpened
2023-06-19 06:01:55.366
0
1
3
0
1
0
0
0
1921.0
0.0
Nan
1703.0
Nan
None
Nan
1
2
01708961660028351393477273586554010192
web.formFilledOut
2023-06-19 18:36:49.083
1
1
2
2
0
0
0
0
2365.0
26.0
1.0
Nan
Nan
None
Nan
7
3
01809182902320674899156240602124740853
directMarketing.emailSent
2023-06-21 19:17:12.535
0
1
0
0
0
0
0
0
0.0
Nan
Nan
Nan
Nan
None
Nan
1
4
03441761949943678951106193028739001197
directMarketing.emailSent
2023-06-21 21:58:29.482
0
1
0
0
0
0
0
0
0.0
Nan
Nan
Nan
Nan
None
Nan
1

쿼리 템플릿을 만들어 교육 데이터를 증분적으로 계산

시간에 따라 모델의 정확도를 유지하기 위해 업데이트된 트레이닝 데이터로 모델을 주기적으로 재트레이닝하는 것이 일반적이다. 교육 데이터 세트를 효율적으로 업데이트하는 우수 사례로, 교육 세트 쿼리에서 템플릿을 만들어 새 교육 데이터를 증분적으로 계산할 수 있습니다. 이렇게 하면 교육 데이터가 마지막으로 업데이트된 이후 원래 경험 이벤트 데이터 세트에 추가된 데이터에서만 레이블 및 기능을 계산하고 새 레이블 및 기능을 기존 교육 데이터 세트에 삽입할 수 있습니다.

이렇게 하려면 교육 세트 쿼리를 몇 가지 수정해야 합니다.

  • 존재하지 않는 경우 새 교육 데이터 세트를 만들고, 그렇지 않은 경우 기존 교육 데이터 세트에 새 레이블과 기능을 삽입하는 논리를 추가합니다. 이렇게 하려면 일련의 두 버전의 교육 집합 쿼리가 필요합니다.

    • 먼저 CREATE TABLE IF NOT EXISTS {table_name} AS 문을 사용합니다.
    • 다음으로, 교육 데이터 세트가 이미 있는 경우 INSERT INTO {table_name} 문을 사용합니다.
  • 쿼리를 지정된 간격 내에 추가된 이벤트 데이터로 제한하려면 SNAPSHOT BETWEEN $from_snapshot_id AND $to_snapshot_id 문을 추가하십시오. 스냅숏 ID의 $ 접두사는 쿼리 템플릿이 실행될 때 전달될 변수임을 나타냅니다.

이러한 변경 사항을 적용하면 다음 쿼리가 생성됩니다.

예제 쿼리를 보려면 선택
code language-python
ctas_table_name = "propensity_training_set"

query_training_set_template = f"""
$$ BEGIN

SET @my_table_exists = SELECT table_exists('{ctas_table_name}');

CREATE TABLE IF NOT EXISTS {ctas_table_name} AS
SELECT *
FROM (
    SELECT _{tenant_id}.user_id as userId,
       eventType,
       timestamp,
       SUM(CASE WHEN eventType='web.formFilledOut' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id)
           AS "subscriptionOccurred",
       SUM(CASE WHEN eventType='directMarketing.emailSent' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "emailsReceived",
       SUM(CASE WHEN eventType='directMarketing.emailOpened' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "emailsOpened",
       SUM(CASE WHEN eventType='directMarketing.emailClicked' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "emailsClicked",
       SUM(CASE WHEN eventType='commerce.productViews' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "productsViewed",
       SUM(CASE WHEN eventType='decisioning.propositionInteract' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "propositionInteracts",
       SUM(CASE WHEN eventType='decisioning.propositionDismiss' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "propositionDismissed",
       SUM(CASE WHEN eventType='web.webinteraction.linkClicks' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "webLinkClicks" ,
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailSent', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_emailSent",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailOpened', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_emailOpened",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailClicked', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_emailClick",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'commerce.productViews', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_productView",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'decisioning.propositionInteract', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_propositionInteract",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'propositionDismiss', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_propositionDismiss",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'web.webinteraction.linkClicks', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_linkClick",
        row_number() OVER (PARTITION BY _{tenant_id}.user_id ORDER BY randn()) AS random_row_number_for_user
    FROM {table_name}
    SNAPSHOT BETWEEN $from_snapshot_id AND $to_snapshot_id
)
WHERE (subscriptionOccurred = 1 AND eventType = 'web.formFilledOut') OR (subscriptionOccurred = 0 AND random_row_number_for_user = 1)
ORDER BY timestamp;

INSERT INTO {ctas_table_name}
SELECT *
FROM (
    SELECT _{tenant_id}.user_id as userId,
       eventType,
       timestamp,
       SUM(CASE WHEN eventType='web.formFilledOut' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id)
           AS "subscriptionOccurred",
       SUM(CASE WHEN eventType='directMarketing.emailSent' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "emailsReceived",
       SUM(CASE WHEN eventType='directMarketing.emailOpened' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "emailsOpened",
       SUM(CASE WHEN eventType='directMarketing.emailClicked' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "emailsClicked",
       SUM(CASE WHEN eventType='commerce.productViews' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "productsViewed",
       SUM(CASE WHEN eventType='decisioning.propositionInteract' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "propositionInteracts",
       SUM(CASE WHEN eventType='decisioning.propositionDismiss' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "propositionDismissed",
       SUM(CASE WHEN eventType='web.webinteraction.linkClicks' THEN 1 ELSE 0 END)
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "webLinkClicks" ,
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailSent', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_emailSent",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailOpened', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_emailOpened",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailClicked', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_emailClick",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'commerce.productViews', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_productView",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'decisioning.propositionInteract', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_propositionInteract",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'propositionDismiss', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_propositionDismiss",
       TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'web.webinteraction.linkClicks', 'minutes')
           OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
           AS "minutes_since_linkClick",
        row_number() OVER (PARTITION BY _{tenant_id}.user_id ORDER BY randn()) AS random_row_number_for_user
    FROM {table_name}
    SNAPSHOT BETWEEN $from_snapshot_id AND $to_snapshot_id
)
WHERE
    @my_table_exists = 't' AND
    ((subscriptionOccurred = 1 AND eventType = 'web.formFilledOut') OR (subscriptionOccurred = 0 AND random_row_number_for_user = 1))
ORDER BY timestamp;

EXCEPTION
  WHEN OTHER THEN
    SELECT 'ERROR';

END $$;
"""

마지막으로 다음 코드는 쿼리 템플릿을 데이터 Distiller에 저장합니다.

template_res = dd.createQueryTemplate({
  "sql": query_training_set_template,
  "queryParameters": {},
  "name": "Template for propensity training data"
})
template_id = template_res["id"]

print(f"Template for propensity training data created as ID {template_id}")

샘플 출력

Template for propensity training data created as ID f3d1ec6b-40c2-4d13-93b6-734c1b3c7235

템플릿이 저장된 경우 언제든지 템플릿 ID를 참조하여 쿼리를 실행하고 쿼리에 포함되어야 하는 스냅숏 ID의 범위를 지정할 수 있습니다. 다음 쿼리는 원래 경험 이벤트 데이터 세트의 스냅샷을 검색합니다.

query_snapshots = f"""
SELECT snapshot_id
FROM (
    SELECT history_meta('{table_name}')
)
WHERE is_current = true OR snapshot_generation = 0
ORDER BY snapshot_generation ASC
"""

df_snapshots = dd_cursor.query(query_snapshots, output="dataframe")

다음 코드는 첫 번째 및 마지막 스냅샷을 사용하여 전체 데이터 세트를 쿼리하는 쿼리 템플릿 실행을 보여 줍니다.

snapshot_start_id = str(df_snapshots["snapshot_id"].iloc[0])
snapshot_end_id = str(df_snapshots["snapshot_id"].iloc[1])

query_final_res = qs.postQueries(
    name=f"[CMLE][Week2] Query to generate training data created by {username}",
    templateId=template_id,
    queryParameters={
        "from_snapshot_id": snapshot_start_id,
        "to_snapshot_id": snapshot_end_id,
    },
    dbname=f"{cat_conn.sandbox}:all"
)
query_final_id = query_final_res["id"]
print(f"Query started successfully and got assigned ID {query_final_id} - it will take some time to execute")

샘플 출력

Query started successfully and got assigned ID c6ea5009-1315-4839-b072-089ae01e74fd - it will take some time to execute

다음 함수를 정의하여 쿼리의 상태를 정기적으로 확인할 수 있습니다.

def wait_for_query_completion(query_id):
    while True:
        query_info = qs.getQuery(query_id)
        query_state = query_info["state"]
        if query_state in ["SUCCESS", "FAILED"]:
            break
        print("Query is still in progress, sleeping…")
        time.sleep(60)

    duration_secs = query_info["elapsedTime"] / 1000
    if query_state == "SUCCESS":
        print(f"Query completed successfully in {duration_secs} seconds")
    else:
        print(f"Query failed with the following errors:", file=sys.stderr)
        for error in query_info["errors"]:
            print(f"Error code {error['code']}: {error['message']}", file=sys.stderr)

wait_for_query_completion(query_final_id)

샘플 출력

Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query completed successfully in 473.8 seconds

다음 단계:

이 문서를 읽고 Adobe Experience Platform의 데이터를 기계 학습 모델에서 사용할 수 있는 기능 또는 변수로 변환하는 방법을 배웠습니다. 머신 러닝 환경에서 Experience Platform에서 피드 사용자 지정 모델까지 기능 파이프라인을 만드는 다음 단계는 기능 데이터 세트를 내보내기하는 것입니다.

recommendation-more-help
ccf2b369-4031-483f-af63-a93b5ae5e3fb