继续上一讲 模块一:CFPS 数据介绍及数据清理
https://bbs.pinggu.org/thread-15729023-1-1.html
模块二:AI+Python进行单数据库清理
1. 提取目标变量
2. 缺失值、异常值识别与处理
3. 变量标准化处理
4. 保存数据
案例:AI辅助清理认知能力、非认知能力变量
import pandas as pd
import numpy as np
# 示例数据加载
# 模拟一个CFPS样本数据集
data = {
'RESPID': [1001, 1002, 1003, 1004, 1005],
'COGNITIVE1': [5, 8, -97, 10, 7],
'COGNITIVE2': [12, -99, 14, 15, np.nan],
'NONCOGNITIVE1': [3.2, 4.1, -98, 3.8, 4.0],
'NONCOGNITIVE2': [2.9, np.nan, 3.5, -97, 3.8]
}
df = pd.DataFrame(data)
# 1. 提取目标变量
target_vars = ['RESPID', 'COGNITIVE1', 'COGNITIVE2', 'NONCOGNITIVE1', 'NONCOGNITIVE2']
df = df[target_vars]
# 2. 缺失值、异常值识别与处理
def clean_variable(var):
return var.replace([-97, -98, -99], np.nan)
for col in target_vars[1:]:
df[col] = clean_variable(df[col])
if df[col].dtype in [np.float64, np.int64]:
df[col].fillna(df[col].median(), inplace=True)
# 3. 变量标准化处理(以Z-score为例)
for col in target_vars[1:]:
mean = df[col].mean()
std = df[col].std()
df[col + '_z'] = (df[col] - mean) / std
# 4. 保存数据
df_cleaned = df.copy()
print("CFPS清理后数据示例", df_cleaned)
结果
RESPID COGNITIVE1 COGNITIVE2 NONCOGNITIVE1 NONCOGNITIVE2 COGNITIVE1_z \
0 1001 5.0 12.0 3.2 2.9 -1.38675
1 1002 8.0 14.0 4.1 3.5 0.27735
2 1003 7.5 14.0 3.9 3.5 0.00000
3 1004 10.0 15.0 3.8 3.5 1.38675
4 1005 7.0 14.0 4.0 3.8 -0.27735
COGNITIVE2_z NONCOGNITIVE1_z NONCOGNITIVE2_z
0 -1.643168 -1.697056 -1.643168
1 0.182574 0.848528 0.182574
2 0.182574 0.282843 0.182574
3 1.095445 0.000000 0.182574
4 0.182574 0.565685 1.095445
CFPS清理后数据示例(csv格式):
RESPID,COGNITIVE1,COGNITIVE2,NONCOGNITIVE1,NONCOGNITIVE2,COGNITIVE1_z,COGNITIVE2_z,NONCOGNITIVE1_z,NONCOGNITIVE2_z
1001,5.0,12.0,3.2,2.9,-1.386750490563073,-1.6431676725154991,-1.6970562748477138,-1.6431676725154987
1002,8.0,14.0,4.1,3.5,0.2773500981126146,0.18257418583505475,0.8485281374238569,0.18257418583505555
1003,7.5,14.0,3.9,3.5,0.0,0.18257418583505475,0.2828427124746194,0.18257418583505555
1004,10.0,15.0,3.8,3.5,1.386750490563073,1.0954451150103317,0.0,0.18257418583505555
1005,7.0,14.0,4.0,3.8,-0.2773500981126146,0.18257418583505475,0.5656854249492388,1.095445115010332
模块二:AI+Python进行单数据库清理”的详细实操结果主要包括以下内容:
提取目标变量:选取了认知能力(COGNITIVE1, COGNITIVE2)与非认知能力(NONCOGNITIVE1, NONCOGNITIVE2)为分析变量;
缺失值与异常值处理:将 -97, -98, -99 识别为缺失值并使用中位数填补;
变量标准化处理:采用 Z-score 进行标准化,便于后续回归分析或聚类;
结果展示与保存:数据已经清理并标准化,结构整洁,可用于后续分析。