I never write loops or any for loops before, it is my first attempt doing it and I am stuck now.
This is my original table:
df1
ID supply_date supply_days_cnt food
1 2020/01/01 28 cheese
1 2020/02/01 28 cheese
1 2020/03/01 28 meat
1 2020/03/04 28 cheese
1 2020/04/01 28 meat
1 2020/09/01 28 cheese
and I managed to make it as below:
df2
supply_end_Date is supply_date + supply_days_cnt
next_supply_date is the supply_date from next row for the same food category
gap_until_next_supply is next_supply_date - supply_end_date, it can be negative
ID supply_date supply_end_date food next_supply_date gap_until_next_supply
1 2020/01/01 2020/01/29 cheese 2020/02/01 2
1 2020/02/01 2020/02/29 cheese 2020/03/04 3
1 2020/03/01 2020/3/29 meat 2020/04/01 1
1 2020/03/04 2020/04/01 cheese 2020/09/01 150
1 2020/04/01 2020/04/29 meat null null
1 2020/09/01 2020/09/29 cheese null null
Here is the logic in word:
1. If interval between two food shippings is great than or equal to 90 days,
these two food shipping belong to two different lines of shipment
2. For intervals < 90 days between current shipping and previous shipping:
a. If the food shipping are same, they belong to the same line
b. If the food shipping are totally different, look back 90 days:
If the same current food shipping found within past 90 days, then same line
Else look forward 90 days:
If the same current food shipping found within next 90 days, then same line
Else different lines
so the desire output for this should be:
ID supply_start_date supply_end_date food
1 2020/01/01 2020/04/29 cheese,meat
1 2020/09/01 2020/09/29 cheese
Here is my code, not finish yet, but I already get errors.
if food1 = food2
then return agg(array_sort_udf(collect_list('food'))
else if food1 != food2,
if food1.min(supply_date) between food2.min(supply_date) and food2.max(supply_date)
then return agg(array_sort_udf(collect_list('food'))
else if food1.max(supply_date) between food2.min(supply_date) and food2.max(supply_date)
then return agg(array_sort_udf(collect_list('food'))
if food2.min(supply_date) between food1.min(supply_date) and food1.max(supply_date)
then return agg(array_sort_udf(collect_list('food'))
else if food2.max(supply_date) between food1.min(supply_date) and food1.max(supply_date)
then return agg(array_sort_udf(collect_list('food'))
so if there are 6 groups food category, the above code will need to repeat 6 times, it is not very effienficy.
both df1 and df2 are dataframe( pyspark dataframe, should I change them to dictionary or something to use for loop and if else?
is there a way to just use pyspark code to get the desire output? not using loop?
first time writing loop, any help will be very appreciate. I really want to learn it.
Aucun commentaire:
Enregistrer un commentaire