mercredi 18 mars 2020

Python: Function is altering original input variable [duplicate]

The goal is for the variable androidApps to be the original dataset. androidApps1 be the dataset with missing data removed and androidApps2 to be the dataset with duplicates removed. This issue arises when I run the removeDuplicates() function and save it to variable androidApps2 then dataset saved to androidApps1 also get updated to the same dataset as androidApps2 and I can't figure out why.

Here's my code:

Here I open and read my dataset file from csv import reader

openFile = open(r'C:\Users\Jason Minhas\Profitable App Profiles for the App Store and Google Play Markets\rawData\googleplaystore.csv', encoding="utf8")
readFile = reader(openFile)
androidApps = list(readFile)

This function checks to see if any row has a blank and returns true or false. This function is used in the next function

def hasBlank(row):
    for colIndex in range(0,len(row)):
        while row[colIndex] != '':
            break
        else:
            return True
    return False

This function finds apps with missing datapoints and returns a dataset that only has rows with all the columns filled.

def removeRowsWithMissingData(dataset, hasHeader=True):
    cleanDataset = []
    if hasHeader:
        start = 1
    start = 0

    for row in dataset[start:]:
        if hasBlank(row):
            pass  
        else:
            cleanDataset.append(row)

    return cleanDataset

androidApps1 = removeRowsWithMissingData(androidApps)

This is a function that checks for duplicates and prints the number of duplicates, uniques and total.

def dupeCount(dataset, index):
    UNQItems = []
    dupeItems = []
    for item in dataset:
        if item[index] in UNQItems:
            dupeItems.append(item[index])
        else:
            UNQItems.append(item[index])     
    print('Unique Apps = ' + str(len(UNQItems)))
    print('Duplicate Apps = ' + str(len(dupeItems)))
    print('Total Apps = ' + str(len(dupeItems)+len(UNQItems)))

dupeCount(androidApps1,0)

This function removes the duplicates and keeps the rows that have the highest review count. This function returns a clean dataset

def removeDuplicates(dataset, nameIndex, reviewCountIndex, hasHeader=True):
    if hasHeader:
        start = 1
    else:
        start = 0

    #create temp dataset so I dont alter orignial    
    tempDataset = dataset

    tempDataset[start:] = sorted(tempDataset[start:], key=lambda l: int(l[reviewCountIndex]), reverse=True)

    # create UNQ and dupe list
    UNQApp = []
    cleanDataset = []

    #Iterate through apps and keep only first app which will have highest review count since it's sorted.
    for row in tempDataset[start:]:
        appName = row[nameIndex]
        if appName not in UNQApp:
            UNQApp.append(appName)
            cleanDataset.append(row)

    tempDataset[start:] = cleanDataset
    return tempDataset

androidApps2 = removeDuplicates(androidApps1,0,3,hasHeader=True)

dupeCount(androidApps2,0)

Aucun commentaire:

Enregistrer un commentaire