I'm working in Pyspark with two datasets and am stuck on how to code something. I'm sure this could involve if statements, various joins and/or groupBy() functions.
The two dataframes have information essentially like below. In practice the dfs are huge and with many instances of the example given.
df1:
Name1 | Name2 | Key |
---|---|---|
A | Z | 1 |
A | Y | 1 |
B | X | 1 |
B | W | 1 |
C | V | 1 |
df2:
Name1 | Name2 | Key |
---|---|---|
A | Z | 2 |
B | U | 2 |
In df1, sometimes 'name1' will match to multiple 'name2'.
In df2, 'name1' is always unique (ie always one instance of each 'name1').
I would like: if 'name1' matches to multiple 'name2' in df1, and one of those name1/name2 pairs is in df2 (ie A-Z row in the example), then drop all other rows in df1 with that name1 (ie drop A-Y row) (so B-X, B-W, C-V remain).
Any help would be much appreciated!
Aucun commentaire:
Enregistrer un commentaire