天猫复购预测(天池)

题目描述:

阿里云天池官网,天猫复购预测之挑战Baseline:https://tianchi.aliyun.com/competition/entrance/231576/rankingList
(官方给出的模型评价指标基于roc_auc_score,天池官方Baseline解决方案得分为0.704954)

1.任务描述:

双十一活动通常可以为商家带来很多新客户,但其中只有一小部分会成为其忠实客户。对于商家来说,准确识别出哪些消费者更可能再次从他这购买产品,可以让商家更有针对性的做营销运营,从而减少成本,提升投资回报。

本分析报告选取了来自天池的公开数据集(详见天猫复购数据),旨在根据消费者双十一前6个月和双十一当天的购物记录信息,预测其在特定商家的复购概率。

本报告分为几部分:首先清洗数据,然后根据已有数据构建特征,再根据特征训练模型,最后选取表现最优的模型进行预测。

2.选题背景:

商家有时会在特定日期,例如Boxing-day,黑色星期五或是双十一(11月11日)开展大型促销活动或者发放优惠券以吸引消费者,然而很多被吸引来的买家都是一次性消费者,这些促销活动可能对销售业绩的增长并没有长远帮助,因此为解决这个问题,商家需要识别出哪类消费者可以转化为重复购买者。通过对这些潜在的忠诚客户进行定位,商家可以大大降低促销成本,提高投资回报率(Return on Investment, ROI)。众所周知的是,在线投放广告时精准定位客户是件比较难的事情,尤其是针对新消费者的定位。不过,利用天猫长期积累的用户行为日志,我们或许可以解决这个问题。

我们提供了一些商家信息,以及在“双十一”期间购买了对应产品的新消费者信息。你的任务是预测给定的商家中,哪些新消费者在未来会成为忠实客户,即需要预测这些新消费者在6个月内再次购买的概率。

3.数据描述:

数据集包含了匿名用户在 “双十一 “前6个月和”双十一 “当天的购物记录,标签为是否是重复购买者。出于隐私保护,数据采样存在部分偏差,该数据集的统计结果会与天猫的实际情况有一定的偏差,但不影响解决方案的适用性。训练集和测试集数据见文件data_format2.zip,数据详情见下表。
字段名称|描述
:-|:-
user_id|购物者的唯一ID编码
age_range|用户年龄范围。
gender|用户性别。0表示女性,1表示男性,2和NULL表示未知
merchant_id|商家的唯一ID编码
label|取值集合为{0, 1, -1, NULL}。取1表示’userid’是’merchantid’的重复买家,取0则反之。取-1表示’user_id’不是给定商家的新客户,因此不在我们预测范围内,但这些记录可能会提供额外信息。测试集这一部分需要预测,因此为NULL。
activity_log|{userid, merchantid}之间的每组交易中都记录有itemid, categoryid, brand_id, time,用#分隔。记录不按任何特定顺序排序。

我们还以另一种格式提供了相同数据集,可能更方便做特征工程,详情见data_format1.zip文件夹(内含4个文件),数据描述如下。

  • 用户日志行为
字段名称 描述
user_id 购物者的唯一ID编码
item_id 商品的唯一编码
cat_id 商品所属品类的唯一编码
merchant_id 商家的唯一ID编码
brand_id 商品品牌的唯一编码
time_tamp 购买时间(格式:mmdd)action_type包含{0, 1, 2, 3},0表示单击,1表示添加到购物车,2表示购买,3表示添加到收藏夹
  • 用户画像
字段名称 描述
user_id 购物者的唯一ID编码
age_range 用户年龄范围。
gender 用户性别。0表示女性,1表示男性,2和NULL表示未知
  • 训练集和测试集
字段名称 描述
user_id 购物者的唯一ID编码
merchant_id 商家的唯一ID编码
label 包含{0, 1},1表示重复买家,0表示非重复买家。测试集这一部分需要预测,因此为空。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 对pandas和matplotlib 的显示设置
pd.set_option('display.max_columns', 30)
plt.rcParams.update({"font.family":"SimHei","font.size":14})
plt.style.use("tableau-colorblind10")


# # matplotlib支持中文
# plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
# plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号
%matplotlib inline

1. 数据清洗

清洗步骤:

  • 数据类型检查
  • 压缩数据
  • 空值处理
1
2
3
4
5
# load data
# data_user_log = pd.read_csv("/home/mw/input/tmall_repurch6487/data/user_log_format1.csv") # 初次导入数据时启用
data_user_info = pd.read_csv("./data/user_info_format1.csv")
data_train = pd.read_csv("./data/train_format1.csv")
data_test = pd.read_csv("./data/test_format1.csv")

压缩内存:调整数据类型,将原来int64调整为合适的大小,例如:int32、int16、int8,以达到压缩内存的目的。

1
2
3
# 二次导入数据时,指定数据类型以压缩内存
d_types = {'user_id': 'int32', 'item_id': 'int32', 'cat_id': 'int16', 'seller_id': 'int16', 'brand_id': 'float32', 'time_stamp': 'int16', 'action_type': 'int8'}
data_user_log = pd.read_csv("./data/user_log_format1.csv",dtype = d_types)
1
2
3
4
5
# check tables
display(data_user_log.head(1))
display(data_user_info.head(1))
display(data_train.head(1))
display(data_test.head(1))

user_id item_id cat_id seller_id brand_id time_stamp action_type
0 328862 323294 833 2882 2661.0 829 0

user_id age_range gender
0 376517 6.0 1.0

user_id merchant_id label
0 34176 3906 0

user_id merchant_id prob
0 163968 4605 NaN

1.1 数据类型检查

1
2
3
4
5
# check table info
display(data_user_log.info())
display(data_user_info.info())
display(data_train.info())
display(data_test.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54925330 entries, 0 to 54925329
Data columns (total 7 columns):
 #   Column       Dtype  
---  ------       -----  
 0   user_id      int32  
 1   item_id      int32  
 2   cat_id       int16  
 3   seller_id    int16  
 4   brand_id     float32
 5   time_stamp   int16  
 6   action_type  int8   
dtypes: float32(1), int16(3), int32(2), int8(1)
memory usage: 995.2 MB



None


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 424170 entries, 0 to 424169
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   user_id    424170 non-null  int64  
 1   age_range  421953 non-null  float64
 2   gender     417734 non-null  float64
dtypes: float64(2), int64(1)
memory usage: 9.7 MB



None


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260864 entries, 0 to 260863
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   user_id      260864 non-null  int64
 1   merchant_id  260864 non-null  int64
 2   label        260864 non-null  int64
dtypes: int64(3)
memory usage: 6.0 MB



None


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261477 entries, 0 to 261476
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   user_id      261477 non-null  int64  
 1   merchant_id  261477 non-null  int64  
 2   prob         0 non-null       float64
dtypes: float64(1), int64(2)
memory usage: 6.0 MB



None

训练集和测试集都有约26万条数据。

1.2 压缩数据

1
2
3
4
5
6
# 拼接train、test数据,方便下一步提取特征
data_train["origin"] = "train"
data_test["origin"] = "test"
data = pd.concat([data_train,data_test],sort = False)
data = data.drop(["prob"],axis = 1)
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 522341 entries, 0 to 261476
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   user_id      522341 non-null  int64  
 1   merchant_id  522341 non-null  int64  
 2   label        260864 non-null  float64
 3   origin       522341 non-null  object 
dtypes: float64(1), int64(2), object(1)
memory usage: 19.9+ MB
1
2
3
4
5
6
7
8
9
10
11
12
13
# 所有列都是数值型,直接downcast
# 初次压缩时,对所有数据集进行压缩
list = [data,data_user_log,data_user_info]

# 二次导入时无需重复data_user_log压缩
list = [data,data_user_info]

for df in list:
fcols = df.select_dtypes('float').columns
icols = df.select_dtypes('integer').columns
df[fcols] = df[fcols].apply(pd.to_numeric, downcast='float')
df[icols] = df[icols].apply(pd.to_numeric, downcast='integer')

1
2
3
4
# check table info again
display(data_user_log.info())
display(data_user_info.info())
display(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54925330 entries, 0 to 54925329
Data columns (total 7 columns):
 #   Column       Dtype  
---  ------       -----  
 0   user_id      int32  
 1   item_id      int32  
 2   cat_id       int16  
 3   seller_id    int16  
 4   brand_id     float32
 5   time_stamp   int16  
 6   action_type  int8   
dtypes: float32(1), int16(3), int32(2), int8(1)
memory usage: 995.2 MB



None


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 424170 entries, 0 to 424169
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   user_id    424170 non-null  int32  
 1   age_range  421953 non-null  float32
 2   gender     417734 non-null  float32
dtypes: float32(2), int32(1)
memory usage: 4.9 MB



None


<class 'pandas.core.frame.DataFrame'>
Int64Index: 522341 entries, 0 to 261476
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   user_id      522341 non-null  int32  
 1   merchant_id  522341 non-null  int16  
 2   label        260864 non-null  float32
 3   origin       522341 non-null  object 
dtypes: float32(1), int16(1), int32(1), object(1)
memory usage: 13.0+ MB



None
1
2
3
4
5
# 记录数据类型,二次导入时用
d_col = data_user_log.dtypes.index
d_type = [i.name for i in data_user_log.dtypes.values]
column_dict = dict(zip(d_col,d_type))
print(column_dict)
{'user_id': 'int32', 'item_id': 'int32', 'cat_id': 'int16', 'seller_id': 'int16', 'brand_id': 'float32', 'time_stamp': 'int16', 'action_type': 'int8'}
1
2
# 统一字段名
data_user_log.rename(columns = {"seller_id":"merchant_id"},inplace = True)

1.3 空值处理

1
2
3
4
5
# 年龄、性别列存在null值,填补空值
data_user_info["age_range"].fillna(0,inplace = True) # 0和null代表未知
data_user_info["gender"].fillna(0,inplace = True) # 2和null代表未知

data_user_info.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 424170 entries, 0 to 424169
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   user_id    424170 non-null  int32  
 1   age_range  424170 non-null  float32
 2   gender     424170 non-null  float32
dtypes: float32(2), int32(1)
memory usage: 4.9 MB
1
2
# 检查user_log空值
data_user_log.isna().sum()
user_id            0
item_id            0
cat_id             0
merchant_id        0
brand_id       91015
time_stamp         0
action_type        0
dtype: int64
1
2
# brand_id列有较多空值,以0填充
data_user_log["brand_id"].fillna(0, inplace = True)

1.4 数据初步探索

下一步对数据进行初步可视化,检视数据特征。

1
2
3
4
5
6
7
8
9
# 用户年龄分布
tags = data_user_info.age_range.value_counts().sort_index()
age = pd.DataFrame(tags)
age["index"] = range(len(tags))
ax = sns.barplot(x="index", y="age_range", data=age, palette="Blues")
ax.tick_params(labelsize=10)
ax.set_title('age distribution',fontsize=12)
ax.set_xlabel('')
ax.set_ylabel('')
Text(0, 0.5, '')

png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 用户性别分布
sizes = data_user_info.gender.value_counts().sort_index()

labels = ['女性', '男性', '未知']
colors = ['lightcoral', 'lightskyblue', 'yellowgreen']
explode = (0.1, 0, 0)

patches,l_text,p_text = plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')
for t in l_text:
t.set_size(14)
for t in p_text:
t.set_size(10)
plt.show()

png

用户年龄1表示<18岁,2表示18-24岁,3表示25-29岁,4表示30-34岁,5表示35-39岁,6表示40-49岁,7、8表示50岁以上,0表示未知。
性别0表示女性,1表示男性,2表示未知。
可以看出用户主要集中在25-29岁,女性较多。
出于隐私保护,数据采样存在部分偏差,结果并不代表天猫实际情况。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 用户操作类型分布
sizes = data_user_log.action_type.value_counts().sort_index()
# tags.plot.bar()
labels = ['单击', '购物车', '购买', '收藏夹'] # 定义标签
colors = ['lightskyblue', 'yellowgreen', 'gold', 'lightcoral'] # 每一块的颜色
explode = (0.1, 0, 0, 0) # 突出显示,这里仅仅突出显示第二块(即'Hogs')

patches,l_text,p_text = plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal') # 显示为圆(避免比例压缩为椭圆)
for t in l_text:
t.set_size(14)
for t in p_text:
t.set_size(10)
plt.show()

png

操作类型中0表示单击,1表示添加到购物车,2表示购买,3表示添加到收藏夹。
大部分用户都只是进行点击操作,添加购物车的比较少,多为直接购买或添加收藏夹。

2. 构建特征

从业务上思考可能影响复购的因素有:

  • 用户特征:年龄,性别,喜好的产品类型,购买习惯(网购频率、购买点击比等),喜欢尝鲜还是习惯固定店家购买
  • 商家特征:产品结构,流量(用户交互频次、交互天数),口碑(购买点击比),产品评价(用户复购率)
  • 用户-商家特征:用户喜好与商家产品的相似性

因此我们针对用户、商家、用户-商家来分别构建以下特征:

  • 交互次数、交互天数
  • 交互过的商品、品类、品牌、用户/商家数
  • 点击、加购物车、购买、收藏的操作次数
  • 购买点击比
  • 复购率
  • 用户性别、年龄

2.1 用户特征

1
2
# 按user_id分组
groups = data_user_log.groupby(["user_id"])
1
2
3
4
# 统计交互总次数
temp = groups.size().reset_index().rename(columns = {0:"u1"})
data = pd.merge(data,temp, on ="user_id",how = "left")
data.head(3)

user_id merchant_id label origin u1
0 34176 3906 0.0 train 451
1 34176 121 0.0 train 451
2 34176 4356 1.0 train 451
1
2
3
4
# 统计交互天数
temp = groups.time_stamp.nunique().reset_index().rename(columns = {"time_stamp":"u2"})
data = data.merge(temp,on ="user_id",how = "left")
data.head(3)

user_id merchant_id label origin u1 u2
0 34176 3906 0.0 train 451 47
1 34176 121 0.0 train 451 47
2 34176 4356 1.0 train 451 47
1
2
3
4
5
# 统计交互过的商品、品类、品牌、商家数
temp = groups[['item_id','cat_id','merchant_id','brand_id']].nunique().reset_index().rename(columns={
'item_id':'u3','cat_id':'u4','merchant_id':'u5','brand_id':'u6'})
data = data.merge(temp,on ="user_id",how = "left")
data.head(3)

user_id merchant_id label origin u1 u2 u3 u4 u5 u6
0 34176 3906 0.0 train 451 47 256 45 109 108
1 34176 121 0.0 train 451 47 256 45 109 108
2 34176 4356 1.0 train 451 47 256 45 109 108
1
2
3
4
# 统计点击、加购物车、购买、收藏的操作次数
temp = groups['action_type'].value_counts().unstack().reset_index().rename(columns={0:'u7', 1:'u8', 2:'u9', 3:'u10'})
data = data.merge(temp,on ="user_id",how = "left")
data.head(3)

user_id merchant_id label origin u1 u2 u3 u4 u5 u6 u7 u8 u9 u10
0 34176 3906 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0
1 34176 121 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0
2 34176 4356 1.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0
1
2
# 统计购买点击比
data["u11"] = data["u9"]/data["u7"]
1
2
3
4
5
6
7
8
9
10
11
12
# 复购率 = 复购过的商家数/购买过的总商家数
# 按user_id,merchant_id分组,购买天数>1则复购标记为1,反之为0
groups_rb = data_user_log[data_user_log["action_type"]==2].groupby(["user_id","merchant_id"])
temp_rb = groups_rb.time_stamp.nunique().reset_index().rename(columns = {"time_stamp":"n_days"})
temp_rb["label_um"] = [(1 if x > 1 else 0) for x in temp_rb["n_days"]]

# 与data进行匹配
temp = temp_rb.groupby(["user_id","label_um"]).size().unstack(fill_value=0).reset_index()
temp["u12"] = temp[1]/(temp[0]+temp[1])

data = data.merge(temp[["user_id","u12"]],on ="user_id",how = "left")
data.head(3)

user_id merchant_id label origin u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12
0 34176 3906 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 0.045455
1 34176 121 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 0.045455
2 34176 4356 1.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 0.045455
1
2
3
4
5
6
7
8
9
# 性别、年龄独热编码处理
data = data.merge(data_user_info,on ="user_id",how = "left")

temp = pd.get_dummies(data["age_range"],prefix = "age")
temp2 = pd.get_dummies(data["gender"],prefix = "gender")

data = pd.concat([data,temp,temp2],axis = 1)
data.drop(columns = ["age_range","gender"],inplace = True)
data.head(3)

user_id merchant_id label origin u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 age_0.0 age_1.0 age_2.0 age_3.0 age_4.0 age_5.0 age_6.0 age_7.0 age_8.0 gender_0.0 gender_1.0 gender_2.0
0 34176 3906 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 0.045455 0 0 0 0 0 0 1 0 0 1 0 0
1 34176 121 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 0.045455 0 0 0 0 0 0 1 0 0 1 0 0
2 34176 4356 1.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 0.045455 0 0 0 0 0 0 1 0 0 1 0 0

2.2 商家特征

1
2
# 按merchant_id分组
groups = data_user_log.groupby(["merchant_id"])
1
2
3
4
# 统计交互总次数
temp = groups.size().reset_index().rename(columns = {0:"m1"})
data = pd.merge(data,temp, on ="merchant_id",how = "left")
data.head(3)

user_id merchant_id label origin u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 age_0.0 age_1.0 age_2.0 age_3.0 age_4.0 age_5.0 age_6.0 age_7.0 age_8.0 gender_0.0 gender_1.0 gender_2.0 m1
0 34176 3906 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 0.045455 0 0 0 0 0 0 1 0 0 1 0 0 16269
1 34176 121 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 0.045455 0 0 0 0 0 0 1 0 0 1 0 0 79865
2 34176 4356 1.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 0.045455 0 0 0 0 0 0 1 0 0 1 0 0 7269
1
2
3
4
# 统计交互天数
temp = groups.time_stamp.nunique().reset_index().rename(columns = {"time_stamp":"m2"})
data = data.merge(temp,on ="merchant_id",how = "left")
data.head(3)

user_id merchant_id label origin u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 age_0.0 age_1.0 age_2.0 age_3.0 age_4.0 age_5.0 age_6.0 age_7.0 age_8.0 gender_0.0 gender_1.0 gender_2.0 m1 m2
0 34176 3906 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 0.045455 0 0 0 0 0 0 1 0 0 1 0 0 16269 185
1 34176 121 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 0.045455 0 0 0 0 0 0 1 0 0 1 0 0 79865 185
2 34176 4356 1.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 0.045455 0 0 0 0 0 0 1 0 0 1 0 0 7269 155
1
2
3
4
5
# 统计交互过的商品、品类、品牌、用户数
temp = groups[['item_id','cat_id','user_id','brand_id']].nunique().reset_index().rename(columns={
'item_id':'m3','cat_id':'m4','user_id':'m5','brand_id':'m6'})
data = data.merge(temp,on ="merchant_id",how = "left")
data.head(3)

user_id merchant_id label origin u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 ... age_3.0 age_4.0 age_5.0 age_6.0 age_7.0 age_8.0 gender_0.0 gender_1.0 gender_2.0 m1 m2 m3 m4 m5 m6
0 34176 3906 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 0 0 0 1 0 0 1 0 0 16269 185 308 20 5819 2
1 34176 121 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 0 0 0 1 0 0 1 0 0 79865 185 1179 26 10931 2
2 34176 4356 1.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 0 0 0 1 0 0 1 0 0 7269 155 67 15 2281 2

3 rows × 34 columns

1
2
3
4
# 统计点击、加购物车、购买、收藏的操作次数
temp = groups['action_type'].value_counts().unstack().reset_index().rename(columns={0:'m7', 1:'m8', 2:'m9', 3:'m10'})
data = data.merge(temp,on ="merchant_id",how = "left")
data.head(3)

user_id merchant_id label origin u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 ... age_7.0 age_8.0 gender_0.0 gender_1.0 gender_2.0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10
0 34176 3906 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 0 0 1 0 0 16269 185 308 20 5819 2 14870.0 28.0 410.0 961.0
1 34176 121 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 0 0 1 0 0 79865 185 1179 26 10931 2 72265.0 121.0 4780.0 2699.0
2 34176 4356 1.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 0 0 1 0 0 7269 155 67 15 2281 2 6094.0 16.0 963.0 196.0

3 rows × 38 columns

1
2
# 统计购买点击比
data["m11"] = data["m9"]/data["m7"]
1
2
3
4
5
6
7
8
# 复购率 = 复购过的用户数/购买过的总用户数
# 按user_id,merchant_id分组,购买天数>1则复购标记为1,反之为0(在上一步已计算)
# 与data进行匹配
temp = temp_rb.groupby(["merchant_id","label_um"]).size().unstack(fill_value=0).reset_index()
temp["m12"] = temp[1]/(temp[0]+temp[1])

data = data.merge(temp[["merchant_id","m12"]],on ="merchant_id",how = "left")
data.head(3)

user_id merchant_id label origin u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 ... gender_0.0 gender_1.0 gender_2.0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12
0 34176 3906 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 1 0 0 16269 185 308 20 5819 2 14870.0 28.0 410.0 961.0 0.027572 0.048387
1 34176 121 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 1 0 0 79865 185 1179 26 10931 2 72265.0 121.0 4780.0 2699.0 0.066145 0.053014
2 34176 4356 1.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 1 0 0 7269 155 67 15 2281 2 6094.0 16.0 963.0 196.0 0.158024 0.084444

3 rows × 40 columns

2.3 用户-商家特征

1
2
# 按user_id,merchant_id分组
groups = data_user_log.groupby(['user_id','merchant_id'])
1
2
3
4
# 统计交互总次数
temp = groups.size().reset_index().rename(columns = {0:"um1"})
data = pd.merge(data,temp, on =["merchant_id","user_id"],how = "left")
data.head(3)

user_id merchant_id label origin u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 ... gender_1.0 gender_2.0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 um1
0 34176 3906 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 0 0 16269 185 308 20 5819 2 14870.0 28.0 410.0 961.0 0.027572 0.048387 39
1 34176 121 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 0 0 79865 185 1179 26 10931 2 72265.0 121.0 4780.0 2699.0 0.066145 0.053014 14
2 34176 4356 1.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 0 0 7269 155 67 15 2281 2 6094.0 16.0 963.0 196.0 0.158024 0.084444 18

3 rows × 41 columns

1
2
3
4
# 统计交互天数
temp = groups.time_stamp.nunique().reset_index().rename(columns = {"time_stamp":"um2"})
data = data.merge(temp,on =["merchant_id","user_id"],how = "left")
data.head(3)

user_id merchant_id label origin u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 ... gender_2.0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 um1 um2
0 34176 3906 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 0 16269 185 308 20 5819 2 14870.0 28.0 410.0 961.0 0.027572 0.048387 39 9
1 34176 121 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 0 79865 185 1179 26 10931 2 72265.0 121.0 4780.0 2699.0 0.066145 0.053014 14 3
2 34176 4356 1.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 0 7269 155 67 15 2281 2 6094.0 16.0 963.0 196.0 0.158024 0.084444 18 2

3 rows × 42 columns

1
2
3
4
5
# 统计交互过的商品、品类、品牌数
temp = groups[['item_id','cat_id','brand_id']].nunique().reset_index().rename(columns={
'item_id':'um3','cat_id':'um4','brand_id':'um5'})
data = data.merge(temp,on =["merchant_id","user_id"],how = "left")
data.head(3)

user_id merchant_id label origin u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 ... m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 um1 um2 um3 um4 um5
0 34176 3906 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 308 20 5819 2 14870.0 28.0 410.0 961.0 0.027572 0.048387 39 9 20 6 1
1 34176 121 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 1179 26 10931 2 72265.0 121.0 4780.0 2699.0 0.066145 0.053014 14 3 1 1 1
2 34176 4356 1.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 67 15 2281 2 6094.0 16.0 963.0 196.0 0.158024 0.084444 18 2 2 1 1

3 rows × 45 columns

1
2
3
4
# 统计点击、加购物车、购买、收藏的操作次数
temp = groups['action_type'].value_counts().unstack().reset_index().rename(columns={0:'um6', 1:'um7', 2:'um8', 3:'um9'})
data = data.merge(temp,on =["merchant_id","user_id"],how = "left")
data.head(3)

user_id merchant_id label origin u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 ... m7 m8 m9 m10 m11 m12 um1 um2 um3 um4 um5 um6 um7 um8 um9
0 34176 3906 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 14870.0 28.0 410.0 961.0 0.027572 0.048387 39 9 20 6 1 36.0 NaN 1.0 2.0
1 34176 121 0.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 72265.0 121.0 4780.0 2699.0 0.066145 0.053014 14 3 1 1 1 13.0 NaN 1.0 NaN
2 34176 4356 1.0 train 451 47 256 45 109 108 410.0 NaN 34.0 7.0 0.082927 ... 6094.0 16.0 963.0 196.0 0.158024 0.084444 18 2 2 1 1 12.0 NaN 6.0 NaN

3 rows × 49 columns

1
2
# 统计购买点击比
data["um10"] = data["um8"]/data["um6"]
1
2
# 将提取好的特征保存,待下次读取
# data.to_csv("./data/features.csv",index=False)

3. 建模预测

这里我们测试几种模型,并对比和观察各个模型的表现:

  • 二元逻辑回归:针对二分类问题的经典模型,训练快
  • 随机森林:可处理高维数据、大数据集,训练快
  • LightGBM:内存消耗少,可直接处理缺失值,训练快
  • XGBoost:支持并行化,通过正则化防止过拟合,可处理缺失值,适用于中低维数据

3.1 建模预处理

1
2
# # 读取之前储存的特征
# data = pd.read_csv("./data/features.csv")
1
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 522341 entries, 0 to 522340
Data columns (total 50 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   user_id      522341 non-null  int32  
 1   merchant_id  522341 non-null  int16  
 2   label        260864 non-null  float32
 3   origin       522341 non-null  object 
 4   u1           522341 non-null  int64  
 5   u2           522341 non-null  int64  
 6   u3           522341 non-null  int64  
 7   u4           522341 non-null  int64  
 8   u5           522341 non-null  int64  
 9   u6           522341 non-null  int64  
 10  u7           521981 non-null  float64
 11  u8           38179 non-null   float64
 12  u9           522341 non-null  float64
 13  u10          294859 non-null  float64
 14  u11          521981 non-null  float64
 15  u12          522341 non-null  float64
 16  age_0.0      522341 non-null  uint8  
 17  age_1.0      522341 non-null  uint8  
 18  age_2.0      522341 non-null  uint8  
 19  age_3.0      522341 non-null  uint8  
 20  age_4.0      522341 non-null  uint8  
 21  age_5.0      522341 non-null  uint8  
 22  age_6.0      522341 non-null  uint8  
 23  age_7.0      522341 non-null  uint8  
 24  age_8.0      522341 non-null  uint8  
 25  gender_0.0   522341 non-null  uint8  
 26  gender_1.0   522341 non-null  uint8  
 27  gender_2.0   522341 non-null  uint8  
 28  m1           522341 non-null  int64  
 29  m2           522341 non-null  int64  
 30  m3           522341 non-null  int64  
 31  m4           522341 non-null  int64  
 32  m5           522341 non-null  int64  
 33  m6           522341 non-null  int64  
 34  m7           522341 non-null  float64
 35  m8           518289 non-null  float64
 36  m9           522341 non-null  float64
 37  m10          522341 non-null  float64
 38  m11          522341 non-null  float64
 39  m12          522341 non-null  float64
 40  um1          522341 non-null  int64  
 41  um2          522341 non-null  int64  
 42  um3          522341 non-null  int64  
 43  um4          522341 non-null  int64  
 44  um5          522341 non-null  int64  
 45  um6          462933 non-null  float64
 46  um7          9394 non-null    float64
 47  um8          522341 non-null  float64
 48  um9          96551 non-null   float64
 49  um10         462933 non-null  float64
dtypes: float32(1), float64(17), int16(1), int32(1), int64(17), object(1), uint8(12)
memory usage: 154.4+ MB
1
2
3
4
5
6
7
# 数据压缩
fcols = data.select_dtypes('float').columns
icols = data.select_dtypes('integer').columns
data[fcols] = data[fcols].apply(pd.to_numeric, downcast='float')
data[icols] = data[icols].apply(pd.to_numeric, downcast='integer')

data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 522341 entries, 0 to 522340
Data columns (total 50 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   user_id      522341 non-null  int32  
 1   merchant_id  522341 non-null  int16  
 2   label        260864 non-null  float32
 3   origin       522341 non-null  object 
 4   u1           522341 non-null  int16  
 5   u2           522341 non-null  int16  
 6   u3           522341 non-null  int16  
 7   u4           522341 non-null  int16  
 8   u5           522341 non-null  int16  
 9   u6           522341 non-null  int16  
 10  u7           521981 non-null  float32
 11  u8           38179 non-null   float32
 12  u9           522341 non-null  float32
 13  u10          294859 non-null  float32
 14  u11          521981 non-null  float32
 15  u12          522341 non-null  float32
 16  age_0.0      522341 non-null  int8   
 17  age_1.0      522341 non-null  int8   
 18  age_2.0      522341 non-null  int8   
 19  age_3.0      522341 non-null  int8   
 20  age_4.0      522341 non-null  int8   
 21  age_5.0      522341 non-null  int8   
 22  age_6.0      522341 non-null  int8   
 23  age_7.0      522341 non-null  int8   
 24  age_8.0      522341 non-null  int8   
 25  gender_0.0   522341 non-null  int8   
 26  gender_1.0   522341 non-null  int8   
 27  gender_2.0   522341 non-null  int8   
 28  m1           522341 non-null  int32  
 29  m2           522341 non-null  int16  
 30  m3           522341 non-null  int16  
 31  m4           522341 non-null  int16  
 32  m5           522341 non-null  int32  
 33  m6           522341 non-null  int8   
 34  m7           522341 non-null  float32
 35  m8           518289 non-null  float32
 36  m9           522341 non-null  float32
 37  m10          522341 non-null  float32
 38  m11          522341 non-null  float32
 39  m12          522341 non-null  float32
 40  um1          522341 non-null  int16  
 41  um2          522341 non-null  int8   
 42  um3          522341 non-null  int16  
 43  um4          522341 non-null  int8   
 44  um5          522341 non-null  int8   
 45  um6          462933 non-null  float32
 46  um7          9394 non-null    float32
 47  um8          522341 non-null  float32
 48  um9          96551 non-null   float32
 49  um10         462933 non-null  float32
dtypes: float32(18), int16(12), int32(3), int8(16), object(1)
memory usage: 69.7+ MB
1
2
# 部分列存在许多没有匹配的空值,将空值填充为0
data.fillna(0, inplace = True)
1
2
3
# 拆分train、test数据集
train = data[data["origin"]=="train"].drop(["origin"],axis = 1)
test = data[data["origin"]=="test"].drop(["origin","label"],axis = 1)
1
X,Y = train.drop(['label'],axis=1),train['label'] 
1
2
3
# 拆分训练集与验证集
from sklearn.model_selection import train_test_split
train_x,valid_x,train_y,valid_y = train_test_split(X,Y,test_size=0.2)
1
2
3
# 计算train、valid集里正样本比例
print("Ratio of positive samples in train dataset:",train_y.mean())
print("Ratio of positive samples in valid dataset:",valid_y.mean())
Ratio of positive samples in train dataset: 0.06123886629939079
Ratio of positive samples in valid dataset: 0.06079773232340813

train、valid集正样本比例基本一致。

1
2
3
# import libraries
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve
from sklearn.metrics import roc_auc_score

3.2 逻辑回归

1
from sklearn.linear_model import LogisticRegression
1
2
3
# 使用默认参数建模
model = LogisticRegression()
model.fit(train_x,train_y)
E:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(





LogisticRegression()
1
2
3
# evaluate the model 
print('accuracy:',model.score(valid_x,valid_y))
print('roc_auc:',roc_auc_score(valid_y,model.predict_proba(valid_x)[:,1]))
accuracy: 0.9392022693730474
roc_auc: 0.4928235566543885

使用默认参数训练出的模型基本等于无效,这里我们尝试借助 GridSearchCV函数调试参数。

1
2
3
4
5
6
7
8
9
# 调试参数

LG = LogisticRegression()
params = {"solver":["liblinear","saga"], # 适用于大数据集、高维度的solver
"C":[0.01,0.1,1,10,100],
"penalty":["l1","l2"]
}

grid_search = GridSearchCV(LG,params,cv = 5,scoring = "roc_auc")
1
grid_search.fit(train_x,train_y)
1
2
3
# 调参后的最优参数结果
display(grid_search.best_params_)
display(grid_search.best_score_)
1
2
3
4
5
6
7
8
9
10
11
# evaluate the model 
# model = grid_search.best_estimator_

# 二次计算时,可直接使用最优参数建模
LG = LogisticRegression(C = 0.1, penalty = 'l1',solver='liblinear')
LG.fit(train_x,train_y)


auc_lr = roc_auc_score(valid_y,LG.predict_proba(valid_x)[:,1])
print('accuracy:',LG.score(valid_x,valid_y))
print('roc_auc:',auc_lr)
accuracy: 0.9389339313437985
roc_auc: 0.6699541453628105
1
2
# 进行预测
prob_lf = LG.predict_proba(test)[:,1]

3.3 随机森林

1
from sklearn.ensemble import RandomForestClassifier
1
2
3
# 使用默认参数建模
model = RandomForestClassifier()
model.fit(train_x,train_y)
RandomForestClassifier()
1
2
3
4
# evaluate model
auc_rf = roc_auc_score(valid_y,model.predict_proba(valid_x)[:,1])
print('accuracy:',model.score(valid_x,valid_y))
print('roc_auc:',auc_rf)
accuracy: 0.9390297663542445
roc_auc: 0.6500745133672414
1
2
3
4
5
6
7
8
9
# 调参
RF = RandomForestClassifier()
params = {"n_estimators":[50,100],
"max_depth":[5,10,100],
"min_samples_split":[2,10,500],
"min_samples_leaf":[1,50,100]
}

grid_search = GridSearchCV(RF,params,cv = 3,scoring = "roc_auc")
1
grid_search.fit(train_x,train_y)
1
2
3
# 调参后的最优参数结果
display(grid_search.best_params_)
display(grid_search.best_score_)
1
2
3
4
5
6
7
8
9
10
11
# evaluate the model 
# model = grid_search.best_estimator_

# 二次计算时,直接使用最优参数建模
RF = RandomForestClassifier(max_depth=100,min_samples_leaf=50,min_samples_split=10)
RF.fit(train_x,train_y)


auc_rf = roc_auc_score(valid_y,RF.predict_proba(valid_x)[:,1])
print('accuracy:',RF.score(valid_x,valid_y))
print('roc_auc:',auc_rf)
accuracy: 0.9392022693730474
roc_auc: 0.6848666816975427
1
2
3
# top10 features
features = pd.Series(RF.feature_importances_, index=train_x.columns).sort_values()
features[-9:].plot.barh()
<AxesSubplot:>

png

重要性排名前三的特性为:商家用户复购率,用户购买点击比和用户-商家交互过的商品数。

1
2
# 进行预测
prob_rf = RF.predict_proba(test)[:,1]

3.4 LightGBM

1
from lightgbm import LGBMClassifier
1
2
3
# 使用默认参数建模
model = LGBMClassifier()
model.fit(train_x,train_y)
LGBMClassifier()
1
2
3
4
# evaluate model
auc_lgbm = roc_auc_score(valid_y,model.predict_proba(valid_x)[:,1])
print('accuracy:',model.score(valid_x,valid_y))
print('roc_auc:',auc_lgbm)
accuracy: 0.939163935368869
roc_auc: 0.6842040668650429
1
2
3
4
5
6
7
8
9
10
11
12
# 调参
LGBM = LGBMClassifier()
params = {"boosting_type":["gbdt","dart","goss"],
"learning_rate":[0.05,0.1],
"n_estimators":[100,1000],
"num_leaves":[30,100,500],
"max_depth":[10,50,100],
"subsample":[0.5],
"min_split_gain":[0.05]
}

grid_search = GridSearchCV(LGBM,params,cv = 3,scoring = "roc_auc")
1
grid_search.fit(train_x,train_y)
1
2
3
# 调参后的最优参数结果
display(grid_search.best_params_)
display(grid_search.best_score_)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# evaluate the model 
# model = grid_search.best_estimator_

# 二次计算时,直接使用最优参数建模
LGBM = LGBMClassifier(
boosting_type="dart",
learning_rate=0.05,
max_depth = 10,
min_split_gain = 0.05,
n_estimators = 1000,
num_leaves = 30,
subsample = 0.5
)
LGBM.fit(train_x,train_y)


auc_lgbm = roc_auc_score(valid_y,LGBM.predict_proba(valid_x)[:,1])
print('accuracy:',LGBM.score(valid_x,valid_y))
print('roc_auc:',auc_lgbm)
accuracy: 0.939163935368869
roc_auc: 0.687943355403638
1
2
3
# top10 features
import lightgbm
lightgbm.plot_importance(LGBM,max_num_features=10)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='Feature importance', ylabel='Features'>

png

重要性排名前三的特性为:商家用户复购率,用户购买点击比和商家被交互过的品类数。

1
2
# 进行预测
prob_lgbm = LGBM.predict_proba(test)[:,1]

3.5 XGBoost

1
from xgboost import XGBClassifier
1
2
3
# 使用默认参数建模
model = XGBClassifier()
model.fit(train_x,train_y)
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)
1
2
3
4
# evaluate model
auc_xgb = roc_auc_score(valid_y,model.predict_proba(valid_x)[:,1])
print('accuracy:',model.score(valid_x,valid_y))
print('roc_auc:',auc_xgb)
accuracy: 0.9386847603166388
roc_auc: 0.6774144828554727
1
2
3
4
5
6
7
8
9
10
11
12
# 调参
XGB = XGBClassifier()
params = {"eta":[0.05,0.1],
"gamma":[5,50,200],
"min_child_weight":[10,100,1000],
"max_depth":[5,50,100],
"subsample":[0.5],
"objective":["binary:logistic"],
"eval_metric":["auc"]
}

grid_search = GridSearchCV(XGB,params,cv = 3,scoring = "roc_auc")
1
grid_search.fit(train_x,train_y)
1
2
3
# 调参后的最优参数结果
display(grid_search.best_params_)
display(grid_search.best_score_)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# evaluate the model 
# model = grid_search.best_estimator_

# 二次计算时,直接使用最优参数建模
XGB = XGBClassifier(
eta=0.1,
gamma=5,
max_depth = 50,
min_child_weight = 100,
objective = "binary:logistic",
eval_metric = "auc",
subsample = 0.5
)
XGB.fit(train_x,train_y)


auc_xgb = roc_auc_score(valid_y,XGB.predict_proba(valid_x)[:,1])
print('accuracy:',XGB.score(valid_x,valid_y))
print('roc_auc:',auc_xgb)
accuracy: 0.9392022693730474
roc_auc: 0.6886825861417296
1
2
3
# top10 features
import xgboost
xgboost.plot_importance(XGB,max_num_features=10)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>

png

重要性排名前三的特性为:商家用户复购率,用户购买点击比和商家购买点击比。

1
2
# 进行预测
prob_xgb = XGB.predict_proba(test)[:,1]

3.6 基模型结果分析

比较各模型结果如下:

1
2
3
4
# 汇总各模型分数
scores = pd.DataFrame({"auc":[auc_lr,auc_rf,auc_lgbm,auc_xgb],
"model":["LogisticRegression","RandomForest","LightGBM","XGBoost"]})
scores.sort_values(by="auc",ascending=False)

auc model
3 0.688683 XGBoost
2 0.687943 LightGBM
1 0.684867 RandomForest
0 0.669954 LogisticRegression

比较各模型auc分数,LightGBM预测模型的表现最好,其auc分数为0.6885。而在针对各模型的特征重要性的排序中,用户的购买点击比、商家的用户复购率这两个特性在各模型中都排在前列,对模型的影响最大。

我们试着绘制训练和验证过程的学习曲线,对各个单模型是否存在过拟合或者欠拟合等情况进行探究。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# 用sklearn的learning_curve得到training_score和cv_score,使用matplotlib画出learning curve

clfs = [LG,RF,LGBM,XGB]

def plot_learning_curve(clf, title, X, y, ylim=None, cv=None, n_jobs=3, train_sizes=np.linspace(.05, 1., 5)):
train_sizes, train_scores, test_scores = learning_curve(
clf, X, y, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

ax = plt.figure().add_subplot(111)
ax.set_title(title)
if ylim is not None:
ax.ylim(*ylim)
ax.set_xlabel(u"train_num_of_samples")
ax.set_ylabel(u"score")

ax.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std,
alpha=0.1, color="b")
ax.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std,
alpha=0.1, color="r")
ax.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"train score")
ax.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"testCV score")

ax.legend(loc="best")

midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2
diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])
return midpoint, diff

alg_list=['logreg', 'randomforest', 'lightGBM', 'XGBoost']

plot_learning_curve(clfs[0], alg_list[0], X, Y)
plot_learning_curve(clfs[1], alg_list[1], X, Y)
plot_learning_curve(clfs[2], alg_list[2], X, Y)
plot_learning_curve(clfs[3], alg_list[3], X, Y)

(0.9388542876912196, 2.403725037580795e-05)

png

png

png

png

K-折交叉验证(Accuracy)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 10-折交叉验证,计算准确率 Accuracy

kfold = 10
cv_results = []
for classifier in clfs :
cv_results.append(cross_val_score(classifier, X.values, y = Y.values, scoring = "accuracy", cv = kfold, n_jobs=4))

# cv_results 为8*10的结果矩阵
cv_means = []
cv_std = []
for cv_result in cv_results:
cv_means.append(cv_result.mean())
cv_std.append(cv_result.std())

ag = ['logreg', 'randomforest', 'lightGBM', 'XGBoost']
cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors": cv_std,
"Algorithm":ag})

g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="Blues")
g.set_xlabel("CrossValMeans",fontsize=10)
g.set_ylabel('')
plt.xticks(rotation=30)
g = g.set_title("10-fold Cross validation scores",fontsize=12)
E:\Anaconda\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

1
2
3
# 展示10-fold Cross validation的均值得分结果
for i in range(4):
print("{} : {}".format(ag[i],cv_means[i]))
logreg : 0.9386500242771005
randomforest : 0.9388493622744981
lightGBM : 0.9388455288006042
XGBoost : 0.9388608622553315

4. Bagging

比较各模型auc分数,XGBoost和LightGBM预测模型的表现已经很不错。在许多分类问题上集成学习可以带来一下几种可能的好处:

  • 提高模型整体的泛化能力
  • 降低陷入局部极小的风险
  • 提升模型预测的稳定性

因此,我们将四种基模型进行Bagging集成,并测试其性能表现

4.1 集成模型的建立

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#from sklearn.metrics import precision_score

# 定义集成框架
class Bagging(object):
# sklearn机器学习算法的实现都属于estimators的子类:
def __init__(self,estimators):
self.estimator_names = []
self.estimators = []
for i in estimators:
self.estimator_names.append(i[0])
self.estimators.append(i[1])
self.clf = LogisticRegression()

def fit(self, train_x, train_y):
for i in self.estimators:
i.fit(train_x,train_y)
x = np.array([i.predict(train_x) for i in self.estimators]).T
y = train_y
self.clf.fit(x, y)

# 0-1分类问题
# def predict(self,x):
# x = np.array([i.predict(x) for i in self.estimators]).T
# #print(x)
# return self.clf.predict(x)

def predict_proba(self,x):
x = np.array([i.predict_proba(x)[:,1] for i in self.estimators]).T
#print(x)
return self.clf.predict_proba(x)[:,1]

def score(self,x,y):
# s_acc = precision_score(y,self.predict(x))
s_auc = roc_auc_score(y,self.predict_proba(x))
#print(s)
return s_auc

选择模型进行 Bagging

1
bag = Bagging([('logreg',LG),('RandomForests',RF),('LightGBM',LGBM),('XGBoost',XGB)])

4.2 测试模型表现

同样基于roc_auc_score,这里我们进行10轮,最终取平均成绩

1
2
3
4
5
6
7
8
9
10
11
score = 0
for i in range(0,10):
num_test = 0.20
# X_train, X_cv, Y_train, Y_cv = train_test_split(X_all.values, Y_all.values, test_size=num_test)
X_train, X_cv, Y_train, Y_cv = train_x,valid_x,train_y,valid_y
bag.fit(X_train, Y_train)
#Y_test = bag.predict(X_test)
# auc_ = round(bag.score(X_cv, Y_cv) * 100, 2)
auc_ = bag.score(X_cv, Y_cv)
score+=auc_
score/10
0.6899103572351627

与单个学习器中表现最好的XGBoost模型相比(roc_auc_score=0.688683),Bagging模型的表现还是有明显的提升。

4.3 进行预测

为使模型更充分地学习已有训练数据,接下来使用全部的已知标签数据训练模型,然后使用该集成模型对未知数据进行预测

1
2
3
4
5
6
7
8
9
# 使用Bagging模型进行训练和预测
bag.fit(X.values, Y.values)
prob_bagging = bag.predict_proba(test)

# 保存集成模型预测结果,存在submission.csv文件中
submission = pd.DataFrame()
submission[['user_id','merchant_id']] = test[['user_id','merchant_id']]
submission['prob'] = prob_bagging
submission.to_csv('./data/submission.csv',index=False)

参考声明

本报告的实现参考了以下内容:

1
# !jupyter nbconvert RepeatPurchaseForecast.ipynb --to html

天猫复购预测(天池)
https://e-alan.github.io/2022/11/24/天猫复购预测(天池)/
作者
Yubiao Wang
发布于
2022年11月24日
许可协议