current position:Home>Machine Learning | K-proximity regression model prediction combat | python application

Machine Learning | K-proximity regression model prediction combat | python application

2022-08-06 07:53:33Olivia_Pu

K邻近回归模型python应用

在这里插入图片描述
Thank you guys for being here,Maybe you have a big head when you see the code,啊哈哈
这篇文章很简短,Most are code,There is no interpretation of the principle or code,Just to provide a small partner with an idea and method,Hope you guys get it!
如果对KProximity principles or codes are of more interest,Please like or subscribe,The subject will respond to everyone's voices!

If your friends cite my code,Please attach the source,谢谢!
Please attach the source for reprinting,谢谢!

1.加载csv数据

def load_orindata():
# load dataset
    dataset = read_csv('paperuse.csv',sep=',')
    return dataset

2.Isolation forests detect outliers

def Cheak_VF(data):
    clf=IsolationForest()
    pres=clf.fit_predict(trans_data)
    return pres

3.Check the discrete value status

def obser_nominal_vars(nominal_vars,testdata):
    nominal_vars=nominal_vars
    testdata=testdata
    for each in nominal_vars:
        print(each, ':')
        print(testdata[each].agg([‘value_counts’]).T)
        print('='*35)

4.Classification label encoding

A general approach to dealing with discrete values

def lable_trans(testdata,lable_nominal_vars):
    testdata = testdata
    object_cols_lable=lable_nominal_vars
    label_encoder = LabelEncoder()
    for col in object_cols_lable:
        testdata[col]= label_encoder.fit_transform(testdata[col])
    return testdata

5.Hot unique encoding

The hot-unique code has a relatively good performance for discrete values,However, the subject recommends no more than small categories10

def one_hot_trans(object_cols_onehot,testdata):
    object_cols_onehot = object_cols_onehot
    testdata = testdata
    OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
    OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(testdata[object_cols_onehot]))
    OH_cols_train.columns = OH_encoder.get_feature_names_out(input_features=object_cols_onehot)
    num_X_train = testdata.drop(object_cols_onehot, axis=1)
    OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
    # x=testdata.iloc[:, 0:10].values
    # y=testdata.iloc[:,10:11].values
    return OH_X_train

6.分割数据集

def split_test_train(data,train_size):
    data=data
    train_size=train_size
    features = data.drop('OR', axis=1)
    training_features, testing_features, training_target, testing_target =                               train_test_split(features.values, data['OR'].values, random_state=42,train_size=train_size)
    return training_features, testing_features, training_target, testing_target,features

7.构建K邻近回归模型

name='kneighbormodel'
kneighbormodel=KNeighborsRegressor()
print('开始训练模型:'+ name)
kneighbormodel.fit(training_features,training_target)
train_prey_KNeighborsRegressor=kneighbormodel.predict(training_features)
y_pred_KNeighborsRegressor=kneighbormodel.predict(testing_features)  
TestRecomValues(testing_target,y_pred_KNeighborsRegressor,name)
TrainRecomValues(training_target,train_prey_KNeighborsRegressor,name)
all_y,all_prey_KNeighborsRegressor = collection_data(training_target,testing_target,train_prey_KNeighborsRegressor,y_pred_KNeighborsRegressor)

8.Model evaluation function

值得注意的是,Classification and regression problems use different evaluation methods,一定不能混用!The main topic here is the regression model prediction!!常用的5A regression model evaluation index.

def TestRecomValues(testing_target,results,model_name):
  model_name=model_name
  r2_test = r2_score(testing_target,results)
  MAE_test = mean_absolute_error(testing_target,results)
  MSE_test = mean_squared_error(testing_target,results) 
  MAPE_test = mean_absolute_percentage_error(testing_target,results)
  RMSE_test = np.sqrt(mean_squared_error(testing_target,results))
  print(f'model_name : {
      model_name}\nr2_test = {
      r2_test}\nMAE_test = {
      MAE_test}\nMSE_test = {
      MSE_test}\nMAPE_test = {
      MAPE_test}\nRMSE_test = {
      RMSE_test}\n')

终端输出,If you need a nice looking table,Can be output to documentation or developmentHTML获取数据

开始训练模型:kneighbormodel
model_name : kneighbormodel
r2_test = 0.9222065243506832
MAE_test = 1.3888414911329112
MSE_test = 6.040082026483679
MAPE_test = 0.10867874402440293
RMSE_test = 2.4576578334836765

9.The training set and test set merge function

Merging is not necessary here,Rather, it is in preparation for the overall dataset plot below.

def collection_data(training_target,testing_target,train_prey,y_pred):
    training_target=pd.DataFrame(training_target)
    testing_target=pd.DataFrame(testing_target)
    train_prey=pd.DataFrame(train_prey)
    y_pred=pd.DataFrame(y_pred)
    all_y= np.vstack([training_target, testing_target])
    all_prey= np.vstack([train_prey, y_pred])
    return all_y,all_prey

10.绘制图像

DrawPredRealValueLineChart(testing_target,y_pred,(6,6),'KNeighborsRegressor')
Scatter_Line(all_y,all_prey,(12,6),'KNeighborsRegressor')
ScatterPreReal(y_pred,testing_target,train_prey,training_target,(6,6),'KNeighborsRegressor')

def DrawPredRealValueLineChart(all_y,all_prey,figsize,modelname):
    plt.figure(figsize=figsize)
    q=range(0,len(all_y))
    plt.plot(q, all_y, color='orange', label='Dataset actual value')
    plt.plot(q, all_prey, color='darkviolet', label='Dataset predictions')
    plt.xlabel("检测点") #xAxis naming table⽰
    plt.ylabel("采收率%") #yAxis naming table⽰
    plt.title("{}A line chart of the predicted and actual values ​​for the test set".format(modelname)) 
    plt.legend()
    plt.savefig(r'C:\pu\fig\{}DrawPredRealValueLineChart.jpg'.format(modelname), bbox_inches='tight', dpi=800)
    plt.show()

def Scatter_Line(all_y,all_prey,figsize,modelname):
    plt.figure(figsize=figsize)
    q=range(0,len(all_y))
    plt.scatter(q,all_y,color='lightseagreen', label='Training set and test set actual values')
    plt.plot(q, all_prey,color='deeppink', label='Train and test set predictions')
    plt.xlabel("检测点") #xAxis naming table⽰
    plt.ylabel("采收率%") #yAxis naming table⽰
    plt.title("{}Train and test set predicted and actual values".format(modelname)) 
    plt.legend()
    plt.savefig(r‘C:\pu\fig\{
    }Scatter_Line.jpg'.format(modelname), bbox_inches='tight', dpi=800)
    plt.show()

def ScatterPreReal(results,testing_target,training_pre,training_target,figsize,modelname):
    #结果对比 
    plt.figure(figsize=figsize)
    plt.scatter(results,testing_target,marker='o',s=10,c='deeppink',label='测试集预测值')
    plt.scatter(training_pre,training_target,marker='o',s=10,c='lightseagreen',label='training set fitted value')
    plt.title('{}Scatter plot of training and test set predictions'.format(modelname))
    plt.xlabel('预测值')
    plt.ylabel('实际值')
    plt.legend()
    plt.savefig(r'C:\pu\fig\{}ScatterPreReal.jpg'.format(modelname), bbox_inches='tight', dpi=800)
    plt.show()

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

12.A priori optimizes parameters

def test_KNeighborsRegressor_k_w(*data):
    ''' 测试 KNeighborsRegressor 中 n_neighbors 和 weights 参数的影响 '''
    X_train,X_test,y_train,y_test=data
    Ks=np.linspace(1,y_train.size,num=100,endpoint=False,dtype='int')
    weights=['uniform','distance']
    fig=plt.figure()
    ax=fig.add_subplot(1,1,1)
    ### 绘制不同 weights 下, The predicted score varies n_neighbors 的曲线
    for weight in weights:
        training_scores=[]
        testing_scores=[]
        for K in Ks:
            regr=KNeighborsRegressor(weights=weight,n_neighbors=K)
            regr.fit(X_train,y_train)
            testing_scores.append(regr.score(X_test,y_test))
            training_scores.append(regr.score(X_train,y_train))
        ax.plot(Ks,testing_scores,label="testing score:weight=%s"%weight)
        ax.plot(Ks,training_scores,label="training score:weight=%s"%weight)
    ax.legend(loc='best')
    ax.set_xlabel("K")
    ax.set_ylabel("score")
    ax.set_ylim(0,1.05)
    ax.set_title("KNeighborsRegressor")
    plt.show()

在这里插入图片描述

12.A priori optimizes parameters2

def test_KNeighborsRegressor_k_p(*data):
    ''' 测试 KNeighborsRegressor 中 n_neighbors 和 p 参数的影响 '''
    X_train,X_test,y_train,y_test=data
    Ks=np.linspace(1,y_train.size,endpoint=False,dtype='int')
    Ps=[1,2,10]
    fig=plt.figure()
    ax=fig.add_subplot(1,1,1)
    ### 绘制不同 p 下, The predicted score varies n_neighbors 的曲线
    for P in Ps:
        training_scores=[]
        testing_scores=[]
        for K in Ks:
            regr=KNeighborsRegressor(p=P,n_neighbors=K)
            regr.fit(X_train,y_train)
            testing_scores.append(regr.score(X_test,y_test))
            training_scores.append(regr.score(X_train,y_train))
        ax.plot(Ks,testing_scores,label="testing score:p=%d"%P)
        ax.plot(Ks,training_scores,label="training score:p=%d"%P)
    ax.legend(loc='best')
    ax.set_xlabel("K")
    ax.set_ylabel("score")
    ax.set_ylim(0,1.05)
    ax.set_title("KNeighborsRegressor")
    plt.show()

在这里插入图片描述

12.A priori optimizes parameters3

def best_n_neighbor(*data):
    X_train,X_test,y_train,y_test=data
    result={
    }
    for i in range(100):#一般n_neighborsis chosen below the square root of the total number of samples
        knn=KNeighborsRegressor(n_neighbors=(i+1))
        knn.fit(X_train,y_train)
        prediction=knn.predict(X_test)
        score=r2_score(y_test,prediction)
        result[i+1]=score*100
    for i in result.keys():
        if result[i]==max(result.values()):
            print("Best number of neighbors:"+str(i))
    print("模型评分:"+str(max(result.values())))

12.优化参数4

The optimization functions that come with the machine learning library,If you are interested, you can leave a message,I will update.It's worth noting when the data is huge,Grid search requires better computer performance,比如GPU的支持.For medium and larger data or general performance computers,The topic owner does not recommend grid search,Because the machine training time will be very, very, very long.The performance improvement brought by the choice of hyperparameters is relatively small.

  1. 网格搜索GridSearchCV
  2. 随机搜索RandomizedSearchCV
  3. hyperopt

13.Remodeling with optimized parameters

与上文相似,Just substitute the optimized parameters into the model.

14.Test error and distribution

If you are interested, you can leave a message,I will update the code.举个栗子,The picture below expresses thatMLP模型,That is, the error of multiple perceptrons,而非K邻近回归模型.
The sea green dots represent the training set error,Dark pink indicates test set error.
The black line represents the normal distribution,Bars represent errors,The red dashed line represents the nuclear density curve.

在这里插入图片描述

Personal introduction of the subject

A cute little one,本科软件工程,研究生xxxx,双非.
I hope that you guys can learn something along the way,永远年轻,永远热泪盈眶!

copyright notice
author[Olivia_Pu],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/218/202208060745057287.html

Random recommended