Funzioni di ingegnere per l’apprendimento automatico
- Argomenti:
- Query
Creato per:
- Utente
- Sviluppatore
In questo documento viene illustrato come trasformare i dati in Adobe Experience Platform in funzionalità o variabili utilizzabili da un modello di apprendimento automatico. Questo processo è denominato ingegneria delle funzionalità. Utilizza Data Distiller per calcolare le funzioni ML su larga scala e condividerle con il tuo ambiente di apprendimento automatico. Ciò comporta quanto segue:
- Creare un modello di query per definire le etichette di destinazione e le feature da calcolare per il modello
- Eseguire la query e memorizzare i risultati in un set di dati di formazione
Definire i dati di apprendimento
L’esempio seguente illustra una query per derivare i dati di formazione da un set di dati Experience Events per un modello al fine di prevedere la propensione di un utente ad abbonarsi a una newsletter. Gli eventi di abbonamento sono rappresentati dal tipo di evento web.formFilledOut
e altri eventi comportamentali nel set di dati vengono utilizzati per derivare funzionalità a livello di profilo per prevedere gli abbonamenti.
Etichette per query positive e negative
Un set di dati completo per l’addestramento di un modello di apprendimento automatico (supervisionato) include una variabile target o un’etichetta che rappresenta il risultato da prevedere e un set di funzioni o variabili esplicative utilizzate per descrivere i profili di esempio utilizzati per addestrare il modello.
In questo caso, l'etichetta è una variabile denominata subscriptionOccurred
che è uguale a 1 se il profilo utente ha un evento con tipo web.formFilledOut
e a 0 in caso contrario. La query seguente restituisce un set di 50.000 utenti dal set di dati degli eventi, inclusi tutti gli utenti con etichette positive (subscriptionOccurred = 1
) più un set di utenti selezionati in modo casuale con etichette negative per completare la dimensione del campione di 50.000 utenti. In questo modo i dati di formazione includono esempi positivi e negativi da cui il modello può imparare.
from aepp import queryservice
dd_conn = queryservice.QueryService().connection()
dd_cursor = queryservice.InteractiveQuery2(dd_conn)
query_labels = f"""
SELECT *
FROM (
SELECT
eventType,
_{tenant_id}.user_id as userId,
SUM(CASE WHEN eventType='web.formFilledOut' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id)
AS "subscriptionOccurred",
row_number() OVER (PARTITION BY _{tenant_id}.user_id ORDER BY randn()) AS random_row_number_for_user
FROM {table_name}
)
WHERE (subscriptionOccurred = 1 AND eventType = 'web.formFilledOut') OR (subscriptionOccurred = 0 AND random_row_number_for_user = 1)
"""
df_labels = dd_cursor.query(query_labels, output="dataframe")
print(f"Number of classes: {len(df_labels)}")
df_labels.head()
Output di esempio
Numero di classi: 50000
eventType | userId | subscriptionOccurred | random_row_number_for_user | |
---|---|---|---|---|
0 | directMarketing.emailClicked | 01027994177972439148069092698714414382 | 0 | 1 |
1 | directMarketing.emailOpened | 01054714817856066632264746967668888198 | 0 | 1 |
2 | web.formFilledOut | 01117296890525140996735553609305695042 | 1 | 15 |
3 | directMarketing.emailClicked | 01149554820363915324573708359099551093 | 0 | 1 |
4 | directMarketing.emailClicked | 01172121447143590196349410086995740317 | 0 | 1 |
Aggregare gli eventi per definire le funzioni per ML
Con una query appropriata puoi raccogliere gli eventi nel set di dati in funzioni numeriche significative che possono essere utilizzate per addestrare un modello di tendenza. Di seguito sono riportati alcuni eventi di esempio:
- Numero di e-mail inviate per scopi di marketing e ricevute dall'utente.
- Parte di queste e-mail aperte .
- Parte di queste e-mail in cui l'utente ha selezionato il collegamento.
- Numero di prodotti visualizzati.
- Numero di proposte con cui si è interagito.
- Numero di proposte ignorate.
- Numero di collegamenti selezionati.
- Numero di minuti tra due e-mail consecutive ricevute.
- Numero di minuti tra due e-mail consecutive aperte.
- Numero di minuti tra due e-mail consecutive in cui l’utente ha effettivamente selezionato il collegamento.
- Numero di minuti tra due visualizzazioni di prodotto consecutive.
- Numero di minuti tra due proposte con cui si è interagito.
- Numero di minuti tra due proposte che sono state ignorate.
- Numero di minuti tra due collegamenti selezionati.
La query seguente aggrega questi eventi:
query_features = f"""
SELECT
_{tenant_id}.user_id as userId,
SUM(CASE WHEN eventType='directMarketing.emailSent' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "emailsReceived",
SUM(CASE WHEN eventType='directMarketing.emailOpened' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "emailsOpened",
SUM(CASE WHEN eventType='directMarketing.emailClicked' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "emailsClicked",
SUM(CASE WHEN eventType='commerce.productViews' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "productsViewed",
SUM(CASE WHEN eventType='decisioning.propositionInteract' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "propositionInteracts",
SUM(CASE WHEN eventType='decisioning.propositionDismiss' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "propositionDismissed",
SUM(CASE WHEN eventType='web.webinteraction.linkClicks' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "webLinkClicks" ,
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailSent', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_emailSent",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailOpened', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_emailOpened",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailClicked', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_emailClick",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'commerce.productViews', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_productView",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'decisioning.propositionInteract', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_propositionInteract",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'propositionDismiss', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_propositionDismiss",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'web.webinteraction.linkClicks', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_linkClick"
FROM {table_name}
"""
df_features = dd_cursor.query(query_features, output="dataframe")
df_features.head()
Output di esempio
Combinare le query di etichette e funzionalità
Infine, la query delle etichette e la query delle funzionalità possono essere combinate in un’unica query che restituisce un set di dati di formazione di etichette e funzionalità:
query_training_set = f"""
SELECT *
FROM (
SELECT _{tenant_id}.user_id as userId,
eventType,
timestamp,
SUM(CASE WHEN eventType='web.formFilledOut' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id)
AS "subscriptionOccurred",
SUM(CASE WHEN eventType='directMarketing.emailSent' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "emailsReceived",
SUM(CASE WHEN eventType='directMarketing.emailOpened' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "emailsOpened",
SUM(CASE WHEN eventType='directMarketing.emailClicked' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "emailsClicked",
SUM(CASE WHEN eventType='commerce.productViews' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "productsViewed",
SUM(CASE WHEN eventType='decisioning.propositionInteract' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "propositionInteracts",
SUM(CASE WHEN eventType='decisioning.propositionDismiss' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "propositionDismissed",
SUM(CASE WHEN eventType='web.webinteraction.linkClicks' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "webLinkClicks" ,
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailSent', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_emailSent",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailOpened', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_emailOpened",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailClicked', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_emailClick",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'commerce.productViews', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_productView",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'decisioning.propositionInteract', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_propositionInteract",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'propositionDismiss', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_propositionDismiss",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'web.webinteraction.linkClicks', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_linkClick",
row_number() OVER (PARTITION BY _{tenant_id}.user_id ORDER BY randn()) AS random_row_number_for_user
FROM {table_name} LIMIT 1000
)
WHERE (subscriptionOccurred = 1 AND eventType = 'web.formFilledOut') OR (subscriptionOccurred = 0 AND random_row_number_for_user = 1)
ORDER BY timestamp
"""
df_training_set = dd_cursor.query(query_training_set, output="dataframe")
df_training_set.head()
Output di esempio
Crea un modello di query per calcolare in modo incrementale i dati di apprendimento
È tipico riaddestrare periodicamente un modello con dati di addestramento aggiornati per mantenere la precisione del modello nel tempo. Come best practice per aggiornare in modo efficiente il set di dati di formazione, puoi creare un modello dalla query del set di formazione per calcolare i nuovi dati di formazione in modo incrementale. Questo consente di calcolare etichette e funzioni solo dai dati aggiunti al set di dati Experience Events originale dall’ultimo aggiornamento dei dati di formazione e di inserire le nuove etichette e funzioni nel set di dati di formazione esistente.
Per farlo, è necessario apportare alcune modifiche alla query del set di apprendimento:
-
Aggiungi la logica per creare un nuovo set di dati di apprendimento, se non esiste, e in caso contrario inserisci le nuove etichette e funzioni nel set di dati di apprendimento esistente. Ciò richiede una serie di due versioni della query del set di apprendimento:
- Utilizzare innanzitutto l'istruzione
CREATE TABLE IF NOT EXISTS {table_name} AS
- Quindi, utilizzando l’istruzione
INSERT INTO {table_name}
per il caso in cui il set di dati di formazione esiste già
- Utilizzare innanzitutto l'istruzione
-
Aggiungere un'istruzione
SNAPSHOT BETWEEN $from_snapshot_id AND $to_snapshot_id
per limitare la query ai dati evento aggiunti entro un intervallo specificato. Il prefisso$
sugli ID snapshot indica che si tratta di variabili che verranno passate quando viene eseguito il modello di query.
L’applicazione di tali modifiche determina la seguente query:
ctas_table_name = "propensity_training_set"
query_training_set_template = f"""
$$ BEGIN
SET @my_table_exists = SELECT table_exists('{ctas_table_name}');
CREATE TABLE IF NOT EXISTS {ctas_table_name} AS
SELECT *
FROM (
SELECT _{tenant_id}.user_id as userId,
eventType,
timestamp,
SUM(CASE WHEN eventType='web.formFilledOut' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id)
AS "subscriptionOccurred",
SUM(CASE WHEN eventType='directMarketing.emailSent' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "emailsReceived",
SUM(CASE WHEN eventType='directMarketing.emailOpened' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "emailsOpened",
SUM(CASE WHEN eventType='directMarketing.emailClicked' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "emailsClicked",
SUM(CASE WHEN eventType='commerce.productViews' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "productsViewed",
SUM(CASE WHEN eventType='decisioning.propositionInteract' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "propositionInteracts",
SUM(CASE WHEN eventType='decisioning.propositionDismiss' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "propositionDismissed",
SUM(CASE WHEN eventType='web.webinteraction.linkClicks' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "webLinkClicks" ,
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailSent', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_emailSent",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailOpened', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_emailOpened",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailClicked', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_emailClick",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'commerce.productViews', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_productView",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'decisioning.propositionInteract', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_propositionInteract",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'propositionDismiss', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_propositionDismiss",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'web.webinteraction.linkClicks', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_linkClick",
row_number() OVER (PARTITION BY _{tenant_id}.user_id ORDER BY randn()) AS random_row_number_for_user
FROM {table_name}
SNAPSHOT BETWEEN $from_snapshot_id AND $to_snapshot_id
)
WHERE (subscriptionOccurred = 1 AND eventType = 'web.formFilledOut') OR (subscriptionOccurred = 0 AND random_row_number_for_user = 1)
ORDER BY timestamp;
INSERT INTO {ctas_table_name}
SELECT *
FROM (
SELECT _{tenant_id}.user_id as userId,
eventType,
timestamp,
SUM(CASE WHEN eventType='web.formFilledOut' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id)
AS "subscriptionOccurred",
SUM(CASE WHEN eventType='directMarketing.emailSent' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "emailsReceived",
SUM(CASE WHEN eventType='directMarketing.emailOpened' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "emailsOpened",
SUM(CASE WHEN eventType='directMarketing.emailClicked' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "emailsClicked",
SUM(CASE WHEN eventType='commerce.productViews' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "productsViewed",
SUM(CASE WHEN eventType='decisioning.propositionInteract' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "propositionInteracts",
SUM(CASE WHEN eventType='decisioning.propositionDismiss' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "propositionDismissed",
SUM(CASE WHEN eventType='web.webinteraction.linkClicks' THEN 1 ELSE 0 END)
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "webLinkClicks" ,
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailSent', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_emailSent",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailOpened', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_emailOpened",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'directMarketing.emailClicked', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_emailClick",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'commerce.productViews', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_productView",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'decisioning.propositionInteract', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_propositionInteract",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'propositionDismiss', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_propositionDismiss",
TIME_BETWEEN_PREVIOUS_MATCH(timestamp, eventType = 'web.webinteraction.linkClicks', 'minutes')
OVER (PARTITION BY _{tenant_id}.user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "minutes_since_linkClick",
row_number() OVER (PARTITION BY _{tenant_id}.user_id ORDER BY randn()) AS random_row_number_for_user
FROM {table_name}
SNAPSHOT BETWEEN $from_snapshot_id AND $to_snapshot_id
)
WHERE
@my_table_exists = 't' AND
((subscriptionOccurred = 1 AND eventType = 'web.formFilledOut') OR (subscriptionOccurred = 0 AND random_row_number_for_user = 1))
ORDER BY timestamp;
EXCEPTION
WHEN OTHER THEN
SELECT 'ERROR';
END $$;
"""
Infine, il codice seguente salva il modello di query in Data Distiller:
template_res = dd.createQueryTemplate({
"sql": query_training_set_template,
"queryParameters": {},
"name": "Template for propensity training data"
})
template_id = template_res["id"]
print(f"Template for propensity training data created as ID {template_id}")
Output di esempio
Template for propensity training data created as ID f3d1ec6b-40c2-4d13-93b6-734c1b3c7235
Una volta salvato il modello, è possibile eseguire la query in qualsiasi momento facendo riferimento all'ID modello e specificando l'intervallo di ID snapshot da includere nella query. La query seguente recupera gli snapshot del set di dati Experience Events originale:
query_snapshots = f"""
SELECT snapshot_id
FROM (
SELECT history_meta('{table_name}')
)
WHERE is_current = true OR snapshot_generation = 0
ORDER BY snapshot_generation ASC
"""
df_snapshots = dd_cursor.query(query_snapshots, output="dataframe")
Il codice seguente illustra l’esecuzione del modello di query utilizzando il primo e l’ultimo snapshot per eseguire query sull’intero set di dati:
snapshot_start_id = str(df_snapshots["snapshot_id"].iloc[0])
snapshot_end_id = str(df_snapshots["snapshot_id"].iloc[1])
query_final_res = qs.postQueries(
name=f"[CMLE][Week2] Query to generate training data created by {username}",
templateId=template_id,
queryParameters={
"from_snapshot_id": snapshot_start_id,
"to_snapshot_id": snapshot_end_id,
},
dbname=f"{cat_conn.sandbox}:all"
)
query_final_id = query_final_res["id"]
print(f"Query started successfully and got assigned ID {query_final_id} - it will take some time to execute")
Output di esempio
Query started successfully and got assigned ID c6ea5009-1315-4839-b072-089ae01e74fd - it will take some time to execute
È possibile definire la seguente funzione per controllare periodicamente lo stato della query:
def wait_for_query_completion(query_id):
while True:
query_info = qs.getQuery(query_id)
query_state = query_info["state"]
if query_state in ["SUCCESS", "FAILED"]:
break
print("Query is still in progress, sleeping…")
time.sleep(60)
duration_secs = query_info["elapsedTime"] / 1000
if query_state == "SUCCESS":
print(f"Query completed successfully in {duration_secs} seconds")
else:
print(f"Query failed with the following errors:", file=sys.stderr)
for error in query_info["errors"]:
print(f"Error code {error['code']}: {error['message']}", file=sys.stderr)
wait_for_query_completion(query_final_id)
Output di esempio
Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query is still in progress, sleeping…
Query completed successfully in 473.8 seconds
Passaggi successivi:
Leggendo questo documento hai imparato a trasformare i dati in Adobe Experience Platform in funzioni o variabili che possono essere utilizzate da un modello di apprendimento automatico. Il passaggio successivo nella creazione di pipeline di funzionalità da Experience Platform per alimentare modelli personalizzati nell'ambiente di apprendimento automatico è esportare set di dati di funzionalità.