The goal is for the variable androidApps
to be the original dataset. androidApps1
be the dataset with missing data removed and androidApps2 to be the dataset with duplicates removed. This issue arises when I run the removeDuplicates()
function and save it to variable androidApps2
then dataset saved to androidApps1
also get updated to the same dataset as androidApps2
and I can't figure out why.
Here's my code:
Here I open and read my dataset file from csv import reader
openFile = open(r'C:\Users\Jason Minhas\Profitable App Profiles for the App Store and Google Play Markets\rawData\googleplaystore.csv', encoding="utf8")
readFile = reader(openFile)
androidApps = list(readFile)
This function checks to see if any row has a blank and returns true or false. This function is used in the next function
def hasBlank(row):
for colIndex in range(0,len(row)):
while row[colIndex] != '':
break
else:
return True
return False
This function finds apps with missing datapoints and returns a dataset that only has rows with all the columns filled.
def removeRowsWithMissingData(dataset, hasHeader=True):
cleanDataset = []
if hasHeader:
start = 1
start = 0
for row in dataset[start:]:
if hasBlank(row):
pass
else:
cleanDataset.append(row)
return cleanDataset
androidApps1 = removeRowsWithMissingData(androidApps)
This is a function that checks for duplicates and prints the number of duplicates, uniques and total.
def dupeCount(dataset, index):
UNQItems = []
dupeItems = []
for item in dataset:
if item[index] in UNQItems:
dupeItems.append(item[index])
else:
UNQItems.append(item[index])
print('Unique Apps = ' + str(len(UNQItems)))
print('Duplicate Apps = ' + str(len(dupeItems)))
print('Total Apps = ' + str(len(dupeItems)+len(UNQItems)))
dupeCount(androidApps1,0)
This function removes the duplicates and keeps the rows that have the highest review count. This function returns a clean dataset
def removeDuplicates(dataset, nameIndex, reviewCountIndex, hasHeader=True):
if hasHeader:
start = 1
else:
start = 0
#create temp dataset so I dont alter orignial
tempDataset = dataset
tempDataset[start:] = sorted(tempDataset[start:], key=lambda l: int(l[reviewCountIndex]), reverse=True)
# create UNQ and dupe list
UNQApp = []
cleanDataset = []
#Iterate through apps and keep only first app which will have highest review count since it's sorted.
for row in tempDataset[start:]:
appName = row[nameIndex]
if appName not in UNQApp:
UNQApp.append(appName)
cleanDataset.append(row)
tempDataset[start:] = cleanDataset
return tempDataset
androidApps2 = removeDuplicates(androidApps1,0,3,hasHeader=True)
dupeCount(androidApps2,0)
Aucun commentaire:
Enregistrer un commentaire