mercredi 28 juillet 2021

How code python/pyspark if statement (medium/hard complexity)?

I'm working in Pyspark with two datasets and am stuck on how to code something. I'm sure this could involve if statements, various joins and/or groupBy() functions.

The two dataframes have information essentially like below. In practice the dfs are huge and with many instances of the example given.

df1:

Name1 Name2 Key
A Z 1
A Y 1
B X 1
B W 1
C V 1

df2:

Name1 Name2 Key
A Z 2
B U 2

In df1, sometimes 'name1' will match to multiple 'name2'.

In df2, 'name1' is always unique (ie always one instance of each 'name1').

I would like: if 'name1' matches to multiple 'name2' in df1, and one of those name1/name2 pairs is in df2 (ie A-Z row in the example), then drop all other rows in df1 with that name1 (ie drop A-Y row) (so B-X, B-W, C-V remain).

Any help would be much appreciated!

Aucun commentaire:

Enregistrer un commentaire