jeudi 13 août 2020

Is there a way to write this in python more efficiently so that the runtime is less for an if-elif with product range calculating thresholds

I am calculating accuracy scores by determining all possible combinations of thresholds and then using those thresholds to return the matched name, and then scoring each of those threshold combinations to see which returns the highest accuracy score. To do this, i created the possible combinations using product(range()) and then applied those with an if-elif statement, but it takes a very long time (so far over an hour to run on 1300 rows). Is there a better way?

My current code:

from itertools import product

def f(x, ngram_thresh, cosine_thresh, fuzz_thresh, fuzz_rat_thresh, fuzz_prat_thresh, jaro_thresh, jw_thresh, jaccard_thresh, lev_thresh):
    if x['Score_ngrams'] >= ngram_thresh : return x['Name_ngram']
    elif x['Score_cosine_words'] >= cosine_thresh : return x['Name_cosine_words']
    elif x['Score_fuzz'] >= fuzz_thresh : return x['Name_fuzz']
    elif x['Score_fuzz_ratio'] >= fuzz_rat_thresh : return x['Name_fuzz_ratio']
    elif x['Score_fuzz_pratio'] >= fuzz_prat_thresh : return x['Name_fuzz_pratio']
    elif x['Score_jaro'] >= jaro_thresh : return x['Name_jaro']
    elif x['Score_jw'] >= jw_thresh : return x['Name_jw']
    elif x['Score_jaccard'] >= jaccard_thresh : return x['Name_jaccard']
    elif x['Score_lev_r'] >= lev_thresh : return x['Name_lev_r']
    else: return 0

for ngram_t, cosine_t, fuzz_t, fuzz_rat_t, fuzz_prat_t, jaro_t, jw_t, jaccard_t, lev_t in product(range(50,110,5), repeat=9):
    df_fourth[f'Name_Clean_{ngram_t}_{cosine_t}_{fuzz_t}_{fuzz_rat_t}_{fuzz_prat_t}_{jaro_t}_{jw_t}_{jaccard_t}_{lev_t}'] = df_fourth.apply(f, ngram_thresh=ngram_t, cosine_thresh=cosine_t, fuzz_thresh=fuzz_t, fuzz_rat_thresh=fuzz_rat_t, fuzz_prat_thresh=fuzz_prat_t, jaro_thresh=jaro_t, jw_thresh=jw_t, jaccard_thresh=jaccard_t, lev_thresh=lev_t, axis=1)

Aucun commentaire:

Enregistrer un commentaire