15.3使用菜谱实现模型操作

在本练习中,您将采用您的模型,并通过创建菜谱在Adobe Experience Platform进行操作。

recipe builder笔记本会模拟您的模型,以自动打包并操作它。 笔记本具有多个模板单元,您可以将模型代码嵌入其中:

  • 要求和配置单元格允许您添加其他库,为模型配置数据集和调整参数
  • 计算器单元格允许您拆分数据并评估模型的性能
  • 培训和评分数据加载器单元允许您加载培训和评分所需的数据
  • 最后,管道单元格包含培训和评分模型所需的逻辑。

我们简化了操作模型所需的步骤,为您提供了大规模培训、评估和评分的能力,然后将模型打包到Adobe Experience Platform的无缝流程中。 打包至处方还允许您使用具有不同数据集的相同代码,为组织中的不同用例提供支持。 我们的特定用例涉及为用户使用推荐模型代码,这些用户搜索要在网站上购买的产品。

通过转到以下URL登录Adobe Experience Platform: https://experience.adobe.com/platform

登录后,您将登录Adobe Experience Platform的主页。

数据获取

在继续之前,您需要选择沙 。 要选择的沙箱被命名 --aepSandboxId--。 通过单击屏幕顶部蓝 线中的文本 “生产产品”(Production Prod)可以执行此操作。

数据获取

选择适当的沙箱后,屏幕将发生变化,现在您就位于专用沙箱中。

数据获取

在Adobe Experience Platform,单 击屏 幕左侧菜单中的“笔记本”,然后转到 JupyterLab

数据获取

在Jupyter Notebooks中,单击 任务栏 上的+ 图标,打 开Launcher页面。

DSW

然后您将看到:

DSW

通过单击“启 动器 ”上的“菜谱生成器”按钮,打开空白的Recipe Builder笔记本。

DSW

然后,您将拥有空白的新Recipe Builder笔记本。 在继续操作之前,请为笔记本提供一个描述性名称。 右键单击Python 3 [Recipe Builder] .ipynb文件 ,然后单击“ 重命名”。

DSW

入mutual365-insurance-sales-predive.ipynb ,作为笔记本的名称并点击 Enter。 然后您将拥有:

DSW

在此笔记本中,您将执行以下操作:

  • 培训模型
  • 为模型评分
  • 从模型创建菜谱

让我们详细配置所有步骤。

配置文件

在Recipe Builder笔记本中向下滚动,直到您看到“配 置文件”

DSW

您现在需要更新培训配置和 评分配置的单元格

培训配置

单击培训配置的 单元格

你做任何事之前,请注意!! 无论您做什么,都不要删除或覆盖使用%writefile开始 的行。 此行是Recipe Builder笔记本所必需的。

%%writefile ~/my-workspace/.recipes/recipe-tFgHWnuH5/training.conf

DSW

您还将在同一单元格中看到类似的代码:

{
   "trainingDataSetId": "<replace with training dataset id>",
   "ACP_DSW_TRAINING_XDM_SCHEMA": "<replace with training xdm schema id>",
   "tenantId": "_<tenant_id>", 
   "learning_rate": "0.1",
   "n_estimators": "100",
   "max_depth": "3"
}

用此代码替换该代码:

{
    "trainingDataSetId": "--aepCarInsuranceInteractionsDatasetId--",
    "ACP_DSW_TRAINING_XDM_SCHEMA": "https://ns.adobe.com/--aepTenantIdSchema--/schemas/--aepCarInsuranceInteractionsSchemaRef--",
    "train_records_limit":1000000,
    "n_estimators": "80",
    "max_depth": "5",
    "ten_id": "--aepTenantId--"
}
重要

环境变 量aepCarInsuranceInteractionsDatasetId和 aepCarInsuranceInteractionsSchemaRef ​引用在您的Adobe Experience Platform实例中创建的数据集的数据集ID和在您的Adobe Experience Platform实例中创建的模式的模式参考ID。

aepCarInsuranceInteractionsDatasetId 指数据集演示系统的数据集ID —— 网站事件数据集(Global v1.1) , aepCarInsuranceInteractionsSchemaRef 指模式演示系统的模式参考ID —— 网 站事件模式网站(全局v1.1) 。 在笔记本的“培训配置”单元格中粘贴代码时,用数据集ID和环境参考ID 替换模式 变量。

您现在应该在“培训配置”单元 格中有类似 :

DSW

评分配置

在单元格中单击以进 行评分配置

你做任何事之前,请注意!! 无论您做什么,都不要删除或覆盖使用%writefile开始 的行。 此行是Recipe Builder笔记本所必需的。

%%writefile ~/my-workspace/.recipes/recipe-tFgHWnuH5/scoring.conf

DSW

您还将在同一单元格中看到类似的代码:

{
   "scoringDataSetId": "<replace with scoring input dataset id>",
   "scoringResultsDataSetId": "<replace with scoring results dataset id>",
   "ACP_DSW_SCORING_RESULTS_XDM_SCHEMA": "<replace with scoring results xdm schema id>",
   "tenantId": "_<tenant_id>"
}

用此代码替换该代码:

{
   "scoringDataSetId": "--aepCarInsuranceInteractionsDatasetId--",
   "scoringResultsDataSetId": "--aepMlPredictionsDatasetId--",
   "ACP_DSW_SCORING_RESULTS_XDM_SCHEMA": "https://ns.adobe.com/--aepTenantIdSchema--/schemas/--aepMlPredictionsSchemaRef--",
   "ten_id": "--aepTenantId--"
}
重要

环境变 量aepCarInsuranceInteractionsDatasetId、aepMlPrectionsDatasetId 和aepMlPrectionsSchemaRef​**​** 引用在Adobe Experience Platform实例中创建的模式的数据集ID和模式参考ID。

aepCarInsuranceInteractionsDatasetId 指数据集 演示系统的数据集ID —— 网站事件数据集(Global v1.1), aepMlPrectionsDatasetId 指数据集演示系统的数据集ID - ML预测的用户档案数据集(Propections)全局v1.1) ​**​**​**​**,aep MlSchemaSchemaSchemaSystem引用模式演示系统的模式Ref ID - ML预测的用户档案模式(全局v1.1)模式REF。 将代码粘贴到笔记本电脑的“评分配置”单元格时,用环境集ID和模式 参考ID替 换变量。

您现在应该在“评分配置”单元 格中具有类似 :

DSW

培训数据加载器文件

在Recipe Builder笔记本中向下滚动,直到您看 到Training Data Loader文件

DSW

您现在需要更新Training Data Loader文 件的代码

你做任何事之前,请注意!! 无论您做什么,都不要删除或覆盖使用%writefile开始 的行。 此行是Recipe Builder笔记本所必需的。

DSW

在该单元格中,您将找到类似于以下代码:

import pandas as pd
from datetime import datetime, timedelta
from platform_sdk.dataset_reader import DatasetReader
from .utils import get_client_context

def load(config_properties):
    print("Training Data Load Start")

    #########################################
    # Load Data
    #########################################    
    client_context = get_client_context(config_properties)
    
    dataset_reader = DatasetReader(client_context, config_properties['trainingDataSetId'])
    
    timeframe = config_properties.get("timeframe")
    tenant_id = config_properties.get("tenant_id")
    
    if (timeframe is not None):
        date_before = datetime.utcnow().date()
        date_after = date_before - timedelta(minutes=int(timeframe))
        dataframe = dataset_reader.where(dataset_reader[tenant_id + '.date'].gt(str(date_after)).And(dataset_reader[tenant_id + '.date'].lt(str(date_before)))).read()
    else:
        dataframe = dataset_reader.read()

    if '_id' in dataframe.columns:
        #Rename columns to strip tenantId
        dataframe = dataframe.rename(columns = lambda x : str(x)[str(x).find('.')+1:])
        #Drop id and timestamp
        dataframe.drop(['_id', 'timestamp'], axis=1, inplace=True)
    
    #########################################
    # Data Preparation/Feature Engineering
    #########################################    
    dataframe.date = pd.to_datetime(dataframe.date)
    dataframe['week'] = dataframe.date.dt.week
    dataframe['year'] = dataframe.date.dt.year

    dataframe = pd.concat([dataframe, pd.get_dummies(dataframe['storeType'])], axis=1)
    dataframe.drop('storeType', axis=1, inplace=True)
    dataframe['isHoliday'] = dataframe['isHoliday'].astype(int)

    dataframe['weeklySalesAhead'] = dataframe.shift(-45)['weeklySales']
    dataframe['weeklySalesLag'] = dataframe.shift(45)['weeklySales']
    dataframe['weeklySalesDiff'] = (dataframe['weeklySales'] - dataframe['weeklySalesLag']) / dataframe['weeklySalesLag']
    dataframe.dropna(0, inplace=True)

    dataframe = dataframe.set_index(dataframe.date)
    dataframe.drop('date', axis=1, inplace=True)

    print("Training Data Load Finish")
    return dataframe

用此代码替换该代码( 不覆盖%%writefile line):

import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from platform_sdk.dataset_reader import DatasetReader
from .utils import get_client_context

def load(config_properties):
    print("Training Data Load Start")

    #########################################
    # Load Data
    #########################################    
    client_context = get_client_context(config_properties)
    train_records_limit = config_properties['train_records_limit']
    
    dataset_reader = DatasetReader(client_context, config_properties['trainingDataSetId'])
    
    #timeframe = config_properties.get("timeframe")
    #tenant_id = config_properties.get("tenant_id")
    print('Reading Training Data')
    df1 = dataset_reader.limit(train_records_limit).read()
    df1.rename(columns = {config_properties['ten_id']+'.identification.core.ecid' : 'ecid',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.numberKm': 'km',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.type': 'cartype',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.customerAge': 'age',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.customerGender' : 'gender',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.brand' : 'carbrand',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.leasing' : 'leasing',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.customerCity' : 'city',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.customerCountry' : 'country',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.customerNationality' : 'nationality',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.isCustomerPrimaryDriver' : 'primaryuser',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.hasCustomerPurchased' : 'purchase',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.priceBasic' : 'pricequote'}, inplace=True)
    df1 = df1[['ecid', 'km', 'cartype', 'age', 'gender', 'carbrand', 'leasing', 'city', 
           'country', 'nationality', 'primaryuser', 'purchase', 'pricequote', 'timestamp']]
    df1 = df1.loc[df1['age'] != 0]
    
    #########################################
    # Data Rollup
    ######################################### 
    df1['timestamp'] = pd.to_datetime(df1.timestamp)
    df1['hour'] = df1['timestamp'].dt.hour.astype(int)
    df1['dayofweek'] = df1['timestamp'].dt.dayofweek
    df1.loc[(df1['purchase'] == 'true'), 'purchase'] = 1
    df1.loc[(df1['purchase'] == 'false'), 'purchase'] = 0
    df1.loc[(df1['purchase'] == ''), 'purchase'] = 0
    df1.purchase.fillna(0, inplace=True)
    df1['purchase'] = df1['purchase'].astype(int)
    
    df1.dropna(subset = ['ecid'], inplace=True)
    df1['ecid'] = df1['ecid'].astype(str)
    df_conv_dict = df1.groupby('ecid').max()[['purchase']]
    df_conv_dict = df_conv_dict.to_dict()
    
    idx = df1.groupby('ecid')['timestamp'].transform(max) == df1['timestamp']
    df1 = df1[idx]
    df1 = df1.drop_duplicates(subset = 'ecid', keep = 'last', inplace = False)
    #df1['purchase'] = df1['ecid'].map(df_conv_dict['purchase'])

    #########################################
    # Data Preparation/Feature Engineering
    #########################################      
    
    df1['carbrand'] = df1['carbrand'].str.lower()
    df1['country'] = df1['country'].str.lower()
    df1.loc[(df1['carbrand'] == 'vw'), 'carbrand'] = 'volkswagen'
    df1.loc[(df1['carbrand'] == 'citroen'), 'carbrand'] = 'cadillac'
    df1.loc[(df1['carbrand'] == 'opel'), 'carbrand'] = 'bmw'
    df1.loc[(df1['carbrand'] == 'mini'), 'carbrand'] = 'volkswagen'
    
    df1.loc[(df1['cartype'] == 'SUV / Geländewagen') | (df1['cartype'] == 'SUV / Tout terrain'), 'cartype'] = 'suv'
    df1.loc[(df1['cartype'] == 'Kleinwagen') | (df1['cartype'] == 'Untere Mittelklasse') | (df1['cartype'] == 'Mikroklasse'), 'cartype'] = 'hatchback'
    df1.loc[(df1['cartype'] == 'Mittelklasse') | (df1['cartype'] == 'Obere Mittelklasse'), 'cartype'] = 'sedan'
    df1.loc[(df1['cartype'] == 'Kompaktvan / Minivan'), 'cartype'] = 'minivan'
    df1.loc[(df1['cartype'] == 'Cabriolet / Roadster'), 'cartype'] = 'convertible'
    df1.loc[(df1['cartype'] == 'Coupé / Sportwagen'), 'cartype'] = 'coupe'
    df1.loc[(df1['cartype'] == 'dataLayerNull'), 'cartype'] = pd.np.nan
    df1.loc[(df1['cartype'] == 'Luxusklasse'), 'cartype'] = 'luxury'
    df1.loc[(df1['cartype'] == 'Strasse'), 'cartype'] = 'mpv'
    
    df1.loc[(df1['leasing'] == 'false'), 'leasing'] = 'no'
    df1.loc[df1['country'] == 'at', 'country'] = 'austria'
    df1.loc[(df1['leasing'] == 'dataLayerNull'), 'leasing'] = pd.np.nan
    df1.loc[(df1['gender'] == 'dataLayerNull'), 'gender'] = pd.np.nan
    df1.loc[(df1['carbrand'] == 'dataLayerNull'), 'carbrand'] = pd.np.nan

    df1['age'].fillna(df1['age'].median(), inplace=True)
    df1['gender'].fillna('notgiven', inplace=True)
    df1['leasing'].fillna('notgiven', inplace=True)
    df1['carbrand'].fillna('bmw', inplace=True)
    df1['country'].fillna('germany', inplace=True)
    df1['cartype'].fillna('na', inplace=True)
    df1['primaryuser'].fillna('na', inplace=True)
    df1['nationality'].fillna('na', inplace=True)
    df1['km'].fillna('na', inplace=True)
    
    df1['city'] = df1.groupby('country')['city'].transform(lambda x : x.fillna(x.mode()))
    #df1.dropna(subset = ['pricequote'], inplace=True)
    
    #grouping
    grouping_cols = ['carbrand', 'cartype', 'city', 'country']
    
    for col in grouping_cols:
        df_idx = pd.DataFrame(df1[col].value_counts().head(6))

        def grouping(x):
            if x in df_idx.index:
                return x
            else:
                return "Others"
        df1[col] = df1[col].apply(lambda x: grouping(x))
        
    def age(x):
        if x < 20:
            return "u20"
        elif x > 19 and x < 29:
            return "20-28"
        elif x > 28 and x < 43:
            return "29-42"
        elif x > 42 and x < 55:
            return "43-54"
        elif x > 54 and x < 65:
            return "55-64"
        elif x >= 65: 
            return "65+"
        else: 
            return "Others"
        
    df1['age'] = df1['age'].astype(int)
    df1['age_bucket'] = df1['age'].apply(lambda x: age(x))
    
    df_final = df1[['hour', 'dayofweek','age_bucket', 'gender', 'city',  
       'country', 'carbrand', 'cartype', 'leasing', 'pricequote', 'purchase']]
    
    cat_cols = ['age_bucket', 'gender', 'city', 'dayofweek', 'country', 'carbrand', 'cartype', 'leasing']
    df_final = pd.get_dummies(df_final, columns = cat_cols)
    
    dataframe = df_final.copy()
    
    df_final.head(20)
 
    print("Training Data Load Finish")
    return dataframe

您现在应该在Training Data Loader文件单 元格中有类似的 :

DSW

评分数据加载器文件

在Recipe Builder笔记本中向下滚动,直到您看到 Scoring Data Loader文件

DSW

您现在需要更新Scorning Data Loader文 件的代码

你做任何事之前,请注意!! 无论您做什么,都不要删除或覆盖使用%writefile开始 的行。 此行是Recipe Builder笔记本所必需的。

DSW

在该单元格中,您将找到类似于以下代码:

import pandas as pd
from datetime import datetime, timedelta
from .utils import get_client_context
from platform_sdk.dataset_reader import DatasetReader

def load(config_properties):

    print("Scoring Data Load Start")

    #########################################
    # Load Data
    #########################################
    client_context = get_client_context(config_properties)

    dataset_reader = DatasetReader(client_context, config_properties['scoringDataSetId'])
    timeframe = config_properties.get("timeframe")
    tenant_id = config_properties.get("tenant_id")

    if (timeframe is not None):
        date_before = datetime.utcnow().date()
        date_after = date_before - timedelta(minutes=int(timeframe))
        dataframe = dataset_reader.where(dataset_reader[tenant_id + '.date'].gt(str(date_after)).And(dataset_reader[tenant_id + '.date'].lt(str(date_before)))).read()
    else:
        dataframe = dataset_reader.read()
        print(dataframe)

    #########################################
    # Data Preparation/Feature Engineering
    #########################################
    if '_id' in dataframe.columns:
        #Rename columns to strip tenantId
        dataframe = dataframe.rename(columns = lambda x : str(x)[str(x).find('.')+1:])
        #Drop id and timestamp
        dataframe.drop(['_id', 'timestamp'], axis=1, inplace=True)

    dataframe.date = pd.to_datetime(dataframe.date)
    dataframe['week'] = dataframe.date.dt.week
    dataframe['year'] = dataframe.date.dt.year

    dataframe = pd.concat([dataframe, pd.get_dummies(dataframe['storeType'])], axis=1)
    dataframe.drop('storeType', axis=1, inplace=True)
    dataframe['isHoliday'] = dataframe['isHoliday'].astype(int)

    dataframe['weeklySalesAhead'] = dataframe.shift(-45)['weeklySales']
    dataframe['weeklySalesLag'] = dataframe.shift(45)['weeklySales']
    dataframe['weeklySalesDiff'] = (dataframe['weeklySales'] - dataframe['weeklySalesLag']) / dataframe['weeklySalesLag']
    dataframe.dropna(0, inplace=True)

    dataframe = dataframe.set_index(dataframe.date)
    dataframe.drop('date', axis=1, inplace=True)

    print("Scoring Data Load Finish")

    return dataframe

用此代码替换该代码( 不覆盖%%writefile line):

import pandas as pd
from datetime import datetime, timedelta
from .utils import get_client_context
from platform_sdk.dataset_reader import DatasetReader

def load(config_properties):

    print("Scoring Data Load Start")

    #########################################
    # Load Data
    #########################################    
    client_context = get_client_context(config_properties)
    
    dataset_reader = DatasetReader(client_context, config_properties['scoringDataSetId'])
    
    #timeframe = config_properties.get("timeframe")
    #tenant_id = config_properties.get("tenant_id")
    df1 = dataset_reader.read()
    df1.rename(columns = {config_properties['ten_id']+'.identification.core.ecid' : 'ecid',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.numberKm': 'km',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.type': 'cartype',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.customerAge': 'age',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.customerGender' : 'gender',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.brand' : 'carbrand',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.leasing' : 'leasing',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.customerCity' : 'city',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.customerCountry' : 'country',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.customerNationality' : 'nationality',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.isCustomerPrimaryDriver' : 'primaryuser',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.hasCustomerPurchased' : 'purchase',
                     config_properties['ten_id']+'.interactionDetails.insurance.car.priceBasic' : 'pricequote'}, inplace=True)
    df1 = df1[['ecid', 'km', 'cartype', 'age', 'gender', 'carbrand', 'leasing', 'city', 
           'country', 'nationality', 'primaryuser', 'purchase', 'pricequote', 'timestamp']]
    df1 = df1.loc[df1['age'] != 0]
    
    #########################################
    # Data Rollup
    ######################################### 
    df1['timestamp'] = pd.to_datetime(df1.timestamp)
    df1['hour'] = df1['timestamp'].dt.hour.astype(int)
    df1['dayofweek'] = df1['timestamp'].dt.dayofweek
    
    df1.loc[(df1['purchase'] == 'true'), 'purchase'] = 1
    df1.loc[(df1['purchase'] == 'false'), 'purchase'] = 0
    df1.loc[(df1['purchase'] == ''), 'purchase'] = 0
    df1.purchase.fillna(0, inplace=True)
    df1['purchase'] = df1['purchase'].astype(int)
    
    df1.dropna(subset = ['ecid'], inplace=True)
    df1['ecid'] = df1['ecid'].astype(str)
    df_conv_dict = df1.groupby('ecid').max()[['purchase']]
    df_conv_dict = df_conv_dict.to_dict()
    
    idx = df1.groupby('ecid')['timestamp'].transform(max) == df1['timestamp']
    df1 = df1[idx]
    df1 = df1.drop_duplicates(subset = 'ecid', keep = 'last', inplace = False)
    df1['purchase'] = df1['ecid'].map(df_conv_dict['purchase'])
    
    #########################################
    # Data Preparation/Feature Engineering
    #########################################      
    
    df1['carbrand'] = df1['carbrand'].str.lower()
    df1['country'] = df1['country'].str.lower()
    df1.loc[(df1['carbrand'] == 'vw'), 'carbrand'] = 'volkswagen'
    df1.loc[(df1['carbrand'] == 'citroen'), 'carbrand'] = 'cadillac'
    df1.loc[(df1['carbrand'] == 'opel'), 'carbrand'] = 'bmw'
    df1.loc[(df1['carbrand'] == 'mini'), 'carbrand'] = 'volkswagen'
    
    df1.loc[(df1['cartype'] == 'SUV / Geländewagen') | (df1['cartype'] == 'SUV / Tout terrain'), 'cartype'] = 'suv'
    df1.loc[(df1['cartype'] == 'Kleinwagen') | (df1['cartype'] == 'Untere Mittelklasse') | (df1['cartype'] == 'Mikroklasse'), 'cartype'] = 'hatchback'
    df1.loc[(df1['cartype'] == 'Mittelklasse') | (df1['cartype'] == 'Obere Mittelklasse'), 'cartype'] = 'sedan'
    df1.loc[(df1['cartype'] == 'Kompaktvan / Minivan'), 'cartype'] = 'minivan'
    df1.loc[(df1['cartype'] == 'Cabriolet / Roadster'), 'cartype'] = 'convertible'
    df1.loc[(df1['cartype'] == 'Coupé / Sportwagen'), 'cartype'] = 'coupe'
    df1.loc[(df1['cartype'] == 'dataLayerNull'), 'cartype'] = pd.np.nan
    df1.loc[(df1['cartype'] == 'Luxusklasse'), 'cartype'] = 'luxury'
    df1.loc[(df1['cartype'] == 'Strasse'), 'cartype'] = 'mpv'
    
    df1.loc[(df1['leasing'] == 'false'), 'leasing'] = 'no'
    df1.loc[df1['country'] == 'at', 'country'] = 'austria'
    df1.loc[(df1['leasing'] == 'dataLayerNull'), 'leasing'] = pd.np.nan
    df1.loc[(df1['gender'] == 'dataLayerNull'), 'gender'] = pd.np.nan
    df1.loc[(df1['carbrand'] == 'dataLayerNull'), 'carbrand'] = pd.np.nan

    df1['age'].fillna(df1['age'].median(), inplace=True)
    df1['gender'].fillna('notgiven', inplace=True)
    df1['leasing'].fillna('notgiven', inplace=True)
    df1['carbrand'].fillna('bmw', inplace=True)
    df1['country'].fillna('germany', inplace=True)
    df1['cartype'].fillna('na', inplace=True)
    df1['primaryuser'].fillna('na', inplace=True)
    df1['nationality'].fillna('na', inplace=True)
    df1['km'].fillna('na', inplace=True)
    
    df1['city'] = df1.groupby('country')['city'].transform(lambda x : x.fillna(x.mode()))
    df1.dropna(subset = ['pricequote'], inplace=True)
    
    #grouping
    grouping_cols = ['carbrand', 'cartype', 'city', 'country']
    
    for col in grouping_cols:
        df_idx = pd.DataFrame(df1[col].value_counts().head(6))

        def grouping(x):
            if x in df_idx.index:
                return x
            else:
                return "Others"
        df1[col] = df1[col].apply(lambda x: grouping(x))
        
    def age(x):
        if x < 20:
            return "u20"
        elif x > 19 and x < 29:
            return "20-28"
        elif x > 28 and x < 43:
            return "29-42"
        elif x > 42 and x < 55:
            return "43-54"
        elif x > 54 and x < 65:
            return "55-64"
        elif x >= 65: 
            return "65+"
        else: 
            return "Others"
        
    df1['age'] = df1['age'].astype(int)
    df1['age_bucket'] = df1['age'].apply(lambda x: age(x))
    
    df_final = df1[['ecid', 'hour', 'dayofweek','age_bucket', 'gender', 'city',  
       'country', 'carbrand', 'cartype', 'leasing', 'pricequote']]
    
    #cat_cols = ['age_bucket', 'gender', 'city', 'dayofweek', 'country', 'carbrand', 'cartype', 'leasing']
    #df_final = pd.get_dummies(df_final, columns = cat_cols)
    
    dataframe = df_final.copy()

    print("Scoring Data Load Finish")

    return dataframe

您现在应该在Scorning Data Loader文件单 元格中具有类似的 :

DSW

管道文件

在Recipe Builder笔记本中向下滚动,直到您看到“管 线文件”

DSW

您现在需要更新管道文 件的代码

你做任何事之前,请注意!! 无论您做什么,都不要删除或覆盖使用%writefile开始 的行。 此行是Recipe Builder笔记本所必需的。

DSW

在该单元格中,您将找到类似于以下代码:

from sklearn.ensemble import GradientBoostingRegressor

def train(config_properties, data):

    print("Train Start")

    #########################################
    # Extract fields from configProperties
    #########################################
    learning_rate = float(config_properties['learning_rate'])
    n_estimators = int(config_properties['n_estimators'])
    max_depth = int(config_properties['max_depth'])


    #########################################
    # Fit model
    #########################################
    X_train = data.drop('weeklySalesAhead', axis=1).values
    y_train = data['weeklySalesAhead'].values

    seed = 1234
    model = GradientBoostingRegressor(learning_rate=learning_rate,
                                      n_estimators=n_estimators,
                                      max_depth=max_depth,
                                      random_state=seed)

    model.fit(X_train, y_train)
    print("Train Complete")
    return model

def score(config_properties, data, model):

    print("Score Start")

    X_test = data.drop('weeklySalesAhead', axis=1).values
    y_test = data['weeklySalesAhead'].values
    y_pred = model.predict(X_test)

    data['prediction'] = y_pred
    data = data[['store', 'prediction']].reset_index()
    data['date'] = data['date'].astype(str)

    print("Score Complete")
    return data

用此代码替换该代码( 不覆盖%%writefile line):

import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.metrics import roc_curve, auc, mean_squared_error
from sklearn.ensemble import RandomForestClassifier
import os

def underSample(data):
    conv = data[data['purchase'] > 0]
    non_conv = data[data['purchase'] == 0]
    sample_size = len(conv)
    non_sample = non_conv.sample(n = sample_size)
    frames = [conv, non_sample]
    result = pd.concat(frames)
    return result

class RandomForest():
    def __init__(self, config_properties, X, y):
        print("initiating model")
        self.n_estimators = int(config_properties['n_estimators'])
        self.max_depth = int(config_properties['max_depth'])
        self.X = X
        self.y = y
        self.features = X.columns.values.tolist()
        self.model = RandomForestClassifier(n_estimators=self.n_estimators, max_depth=self.max_depth, random_state=32)
        self.trained_model = self.model.fit(self.X,self.y)
    def train_model(self):
        print('fitting model')
        return self.model.fit(self.X,self.y)
    
    def make_tarfile(self, output_filename, source_dir):
        with tarfile.open(output_filename, "w:gz") as tar:
            tar.add(source_dir, arcname=os.path.basename(source_dir))
    
    #for generating onnx
    def generate_onnx_resources(self):        
        install_dir = os.path.expanduser('~/my-workspace')
        print("Generating Onnx")
        try:
            subprocess.check_call(["conda", "uninstall", "-y", "protobuf"])
        except:
            print("protobuf not installed via conda")
        
        subprocess.check_call(["python", '-m', 'pip', 'install', 'skl2onnx'])
        
        from skl2onnx import convert_sklearn
        from skl2onnx.common.data_types import FloatTensorType
        
        # ONNX-ification
        initial_type = [('float_input', FloatTensorType([None, self.feature_len]))]

        print("Converting Model to Onnx")
        onx = convert_sklearn(self.model, initial_types=initial_type)
             
        with open("model_new.onnx", "wb") as f:
            f.write(onx.SerializeToString())
            
        self.make_tarfile('model_new.tar.gz', 'model.onnx')
        print("Model onnx created")

def train(config_properties, data):

    print("Train Start")  
    y_train = data['purchase']
    X_train = data.drop('purchase', axis=1)
    # Fit model
    lead_score = RandomForest(config_properties, X_train, y_train)
    lead_score.train_model()

    print("Train Complete")
    
    return lead_score

def score(config_properties, data, model):

    print("Score Start")
    cat_cols = ['age_bucket', 'gender', 'city', 'dayofweek', 'country', 'carbrand', 'cartype', 'leasing']
    data = pd.get_dummies(data, columns = cat_cols)
    
    X_score = data.drop('ecid', axis=1)
    train_feats = model.features
    print(train_feats)
    
    trained_model = model.trained_model
    score_feats = X_score.columns.values.tolist()
    missing_feats = list(set(train_feats) - set(score_feats))
    extra_feats = list(set(score_feats) - set(train_feats))
    for c in extra_feats:
        X_score.drop(c, axis=1, inplace=True)
    for c in missing_feats:
        X_score[c] = 0
    X_score = X_score[train_feats]
    print(X_score.columns.values.tolist())
    y_preds = trained_model.predict_proba(X_score)[:,1]
    
    data.rename(columns = {'ecid' : config_properties['ten_id']+'.identification.core.ecid'}, inplace=True)
    
    data[config_properties['ten_id']+'.individualScoring.insurance.carInsuranceSalesPrediction'] = y_preds
    data = data[[config_properties['ten_id']+'.identification.core.ecid', config_properties['ten_id']+'.individualScoring.insurance.carInsuranceSalesPrediction']]

    print("Score Complete")
    return data

您现在应该在“管道文件”单元 格中有类似 :

DSW

求值器文件

在Recipe Builder笔记本中向下滚动,直到您看到“求 值器文件”

DSW

您现在需要更新计算器文 件的代码

你做任何事之前,请注意!! 无论您做什么,都不要删除或覆盖使用%writefile开始 的行。 此行是Recipe Builder笔记本所必需的。

DSW

在该单元格中,您将找到类似于以下代码:

from ml.runtime.python.core.regressionEvaluator import RegressionEvaluator
import numpy as np

class Evaluator(RegressionEvaluator):
    def __init__(self):
        print ("Initiate")

    def split(self, config={}, dataframe=None):
        train_start = '2010-02-12'
        train_end = '2012-01-27'
        val_start = '2012-02-03'
        train = dataframe[train_start:train_end]
        val = dataframe[val_start:]

        return train, val

    def evaluate(self, data=[], model={}, config={}):
        print ("Evaluation evaluate triggered")
        val = data.drop('weeklySalesAhead', axis=1)
        y_pred = model.predict(val)
        y_actual = data['weeklySalesAhead'].values
        mape = np.mean(np.abs((y_actual - y_pred) / y_actual))
        mae = np.mean(np.abs(y_actual - y_pred))
        rmse = np.sqrt(np.mean((y_actual - y_pred) ** 2))

        metric = [{"name": "MAPE", "value": mape, "valueType": "double"},
                  {"name": "MAE", "value": mae, "valueType": "double"},
                  {"name": "RMSE", "value": rmse, "valueType": "double"}]
        
        print(metric)
        return metric

用此代码替换该代码( 不覆盖%%writefile line):

from ml.runtime.python.core.regressionEvaluator import RegressionEvaluator
import numpy as np
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split

class Evaluator(RegressionEvaluator):
    def __init__(self):
        print ("Initiate")

    def split(self, config={}, dataframe=None):
        
        train, val = train_test_split(dataframe, test_size = 0.2, random_state = 32)

        return train, val

    def evaluate(self, data=[], model={}, config={}):
        model = model.trained_model
        print ("Evaluation evaluate triggered")
        val = data.drop('purchase', axis=1)
        y_pred = model.predict(val)
        y_actual = data['purchase']
        
        y_pred_proba = model.predict_proba(val)[:,1]
       
        accuracy = metrics.accuracy_score(y_actual, y_pred, normalize=True, sample_weight=None)
        recall = metrics.recall_score(y_actual, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None)
        precision = metrics.precision_score(y_actual, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None)
        
        val_fpr, val_tpr, _ = roc_curve(y_actual, y_pred_proba)
        roc_auc = auc(val_fpr, val_tpr)

        metric = [{"name": "Accuracy", "value": accuracy, "valueType": "double"},
                  {"name": "Recall", "value": recall, "valueType": "double"},
                  {"name": "Precision", "value": precision, "valueType": "double"}]
        
        print(metric)
        return metric

您现在应该在计算器文件单元 格中有类似 :

DSW

数据保护程序文件

在Recipe Builder笔记本中向下滚动,直到您看到“数 据保护程序文件”

DSW

您现在需要更新数据保护程 序文件的代码

你做任何事之前,请注意!! 无论您做什么,都不要删除或覆盖使用%writefile开始 的行。 此行是Recipe Builder笔记本所必需的。

DSW

在该单元格中,您将找到类似于以下代码:

import pandas as pd
from .utils import get_client_context
from platform_sdk.models import Dataset
from platform_sdk.dataset_writer import DatasetWriter

def save(config_properties, prediction):
  print("Datasaver Start")

  client_context = get_client_context(config_properties)
  tenant_id = config_properties.get("tenantId")
  prediction = prediction.add_prefix(tenant_id+".")

  prediction = prediction.join(pd.DataFrame(
      {
          '_id': "",
          'timestamp': '2019-01-01T00:00:00',
          'eventType': ""
      }, index=prediction.index))

  dataset = Dataset(client_context).get_by_id(config_properties['scoringResultsDataSetId'])
  dataset_writer = DatasetWriter(client_context, dataset)
  dataset_writer.write(prediction, file_format='json')

  print("Datasaver Finish")
  print(prediction)

用此代码替换该代码( 不覆盖%%writefile line):

from .utils import get_client_context
from platform_sdk.models import Dataset
from platform_sdk.dataset_writer import DatasetWriter

def save(config_properties, prediction):
  print("Datasaver Start")

  client_context = get_client_context(config_properties)
  dataset = Dataset(client_context).get_by_id(config_properties['scoringResultsDataSetId'])
  dataset_writer = DatasetWriter(client_context, dataset)
  dataset_writer.write(prediction, file_format='json')

  print("Datasaver Finish")
  print(prediction)

您现在应该在“数据保护程序文件” 单元格中有类似 :

DSW

您现在已配置了执行笔记本所需的所有代码。

培训模型

通过单击“培训”按钮可完成 模型 培训。

DSW

单击 顶部 工具栏上的“培训”,在笔记本中创建培训运行。 这执行训练数据加载器、管线和评估器单元,并生成评估度量以评估模型性能。 培训脚本的命令和输出日志将显示在笔记本(在“求值器——单元格”下)中。

单击“ 培训”后,培训运行将开始,并需要几分钟才能完成。

DSWDSWDSW

单击“培训 ,将执行以下单元格:

  • 要求文件
  • 配置文件——培训
  • 培训数据加载器文件
  • 管道文件
  • 求值器文件

为模型评分

通过单击“得分”按钮对模型 进行评 分。

DSW

单击 顶部 工具栏上的“得分”,在单元格中创建培训运行。 这执行评分数据加载器、管线和评估器单元,并生成评估度量以评估模型性能。 培训脚本的命令和输出日志将显示在笔记本(在pipeline.py 单元格下 )中。

单击“ 得分”后,培训运行将开始,并需要几分钟才能完成。

DSWDSWDSW

单击“得分 ”时,将执行以下单元格:

  • 要求文件
  • 配置文件——评分
  • 评分数据加载器文件
  • 管道文件
  • 求值器文件

此外,在“评分运行”结束时,具有倾向得分的输出存储在Adobe Experience Platform的Demo System - ML预测用户档案数据集(Global v1.1)数据集中

您可以在此处验 证此

DSW

从模型创建菜谱

通过单击创建菜谱按钮,可 以创建菜谱

DSW

当您对培训和评分的输出感到满意时,您可以创建菜谱。 单击“ 创建菜谱 ”按钮以开始流程。

创建菜谱使您能够大规模测试模型。

单击“创 建菜谱 ”按钮后,您必须为菜谱输入名称。

DSW

作为命名规范,请使用:

  • LDAPCarInsurancePosite

ldap替 换为ldap。

示例:对于 ldap vangeluw,菜谱的名称应为: vangeluwCar保险倾向

DSW

输入菜谱名称后,单击“ 确定”。

此时会显示第二个弹出窗口,告诉您正在创建菜谱。 这可能需要5分钟,请等到过程结束。

DSW

您可以视图Jupyter Notebooks右上角菜谱创建过程的进度。

DSW

不要执行任何其他操作,您需要在“笔记本”上保持此浏览器窗口打开,同时“菜谱创建”过程将继续进行。

几分钟后,菜谱创建完成。

单击 弹出窗口 上的“视图菜谱”。

DSW

您随后将看到所有可用的菜谱,其中现在也包括您的菜谱。

DSW

既然您已经创建了菜谱,让我们继续下一个练习,在它中您将开始可扩展的培训和实验。

下一步: 15.4培训和评分您的菜谱

返回模块15

返回到所有模块

在此页面上