I am applying a machine learning algorithm to find the feature contribution for a specific sample.
On every iteration, I am trying to delete one column from the training set. The column to be deleted has the highest value contribution. And then re run the fitting again without that column, to find the next highest contribution of the features. Delete this column, and re run the code again until I only have 2 features left. So far, I have figured how to manually do this. I show the first 2 iterations so this is not so long.
First, the following algorithm runs, and prints out the feature that has the top contribution.
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from treeinterpreter import treeinterpreter as ti
##################### Creating the dataset #######################
x_1, y_1 = make_classification(n_samples=1000,
n_features=10,
n_informative=3,
n_classes=2,
random_state=0,
shuffle=False)
# Creating a dataFrame
df = pd.DataFrame({'Feature 1':x_1[:,0],
'Feature 2':x_1[:,1],
'Feature 3':x_1[:,2],
'Feature 4':x_1[:,3],
'Feature 5':x_1[:,4],
'Feature 6':x_1[:,5],
'Feature 7':x_1[:,5],
'Feature 8':x_1[:,5],
'Feature 9':x_1[:,5],
'Feature 10':x_1[:,5],
'Class':y_1})
##################### End of creating the dataset################
y = df['Class']
X = df.drop('Class',axis = 1)
rf = RandomForestClassifier(random_state=0)
rf.fit(X, y)
instances = X.iloc[50].values.reshape(1, -1)
prediction, biases, contributions = ti.predict(rf, instances)
for i in range(len(instances)):
maxList= 0
maxFeature = ''
for c, feature in sorted(zip(contributions[i],
X.columns),
key=lambda x: ~abs(x[0].any())):
if c.max()>maxList:
maxList=c.max()
maxFeature=feature
print (feature, c)
print ("-"*20)
print( 'The highest value was found in ' )
print( maxFeature )
and the output for the above code is
I am trying to take the column name of the max value which is saved in maxfeature and remove it in the next iteration. Such that.
X_new = df.drop(['Class',maxFeature],axis = 1)
X_new.shape
rf.fit(X_new, y)
instances_new = X_new.iloc[50].values.reshape(1, -1)
prediction, biases, contributions = ti.predict(rf, instances_new)
for i in range(len(instances_new)):
maxList_new= 0
maxFeature_new = ''
print ("Feature contributions:")
print ("-"*20)
for c, feature in sorted(zip(contributions[i],
X_new.columns),
key=lambda x: ~abs(x[0].any())):
if c.max()>maxList_new:
maxList_new=c.max()
maxFeature_new=feature
print (feature, c)
print ("-"*20)
print( 'The highest value was found in feature' )
print( maxFeature_new )
The output for the above code will be
We can see that feature 1 is dropped and the calculation is redone with a new output feature 4. The next step will also be iterative, dropping feature 1 and feature 4 and refitting and doing the calculation again.
This can be achieved manually by replicating the same code over and over again with adjustments to the variables, but how can we build a for loop that can automate this.
Any help is appreciated.
I highlighted the following and added some summary just to give an idea
The code keeps on running as long as there is more than 2 columns.



Aucun commentaire:
Enregistrer un commentaire