mardi 23 février 2021

Pyspark If else and loop coding questions

I never write loops or any for loops before, it is my first attempt doing it and I am stuck now.

This is my original table:

df1

ID   supply_date     supply_days_cnt   food  
1      2020/01/01       28             cheese
1      2020/02/01       28             cheese
1      2020/03/01       28              meat
1      2020/03/04       28             cheese
1      2020/04/01       28              meat
1      2020/09/01       28             cheese

and I managed to make it as below:

df2

    supply_end_Date is supply_date + supply_days_cnt
    next_supply_date is the supply_date from next row for the same food category 
    gap_until_next_supply is next_supply_date - supply_end_date, it can be negative 

ID       supply_date   supply_end_date    food   next_supply_date  gap_until_next_supply       
1         2020/01/01    2020/01/29       cheese   2020/02/01           2
1         2020/02/01    2020/02/29       cheese   2020/03/04           3
1         2020/03/01    2020/3/29        meat     2020/04/01           1
1         2020/03/04    2020/04/01       cheese   2020/09/01           150
1         2020/04/01    2020/04/29       meat    null                null
1         2020/09/01    2020/09/29       cheese   null                null

Here is the logic in word:

1.  If interval between two food shippings is great than or equal to 90 days,
    these two food shipping belong to two different lines of shipment 

2.  For intervals < 90 days between current shipping and previous shipping:
a.  If the food shipping are same, they belong to the same line
b.  If the food shipping are totally different, look back 90 days: 
        If the same current food shipping found within past 90 days, then same line
             Else look forward 90 days:
                       If the same current food shipping found within next 90 days, then same line
                                   Else different lines

so the desire output for this should be:

ID supply_start_date  supply_end_date  food
1   2020/01/01         2020/04/29      cheese,meat
1   2020/09/01         2020/09/29      cheese  
  

Here is my code, not finish yet, but I already get errors.

if food1 = food2
    then return agg(array_sort_udf(collect_list('food'))
  else if food1 != food2,
      if  food1.min(supply_date) between food2.min(supply_date) and food2.max(supply_date)
          then return agg(array_sort_udf(collect_list('food'))
        else if food1.max(supply_date) between food2.min(supply_date) and food2.max(supply_date)
          then return agg(array_sort_udf(collect_list('food'))
      if  food2.min(supply_date) between food1.min(supply_date) and food1.max(supply_date)
          then return agg(array_sort_udf(collect_list('food'))
        else if food2.max(supply_date) between food1.min(supply_date) and food1.max(supply_date)
          then return agg(array_sort_udf(collect_list('food'))

so if there are 6 groups food category, the above code will need to repeat 6 times, it is not very effienficy.

both df1 and df2 are dataframe( pyspark dataframe, should I change them to dictionary or something to use for loop and if else?

is there a way to just use pyspark code to get the desire output? not using loop?

first time writing loop, any help will be very appreciate. I really want to learn it.

Aucun commentaire:

Enregistrer un commentaire