mardi 26 janvier 2021

Why this nested "when" does not work in pyspark?

I'm trying to divide people into age range with

from pyspark import SparkFiles
from pyspark.sql import functions as fn

## Import data

url_users = "https://raw.githubusercontent.com/leanhdung1994/BigData/main/users.csv"
spark.sparkContext.addFile(url_users)
users_from_file = spark.read.csv("file://" + SparkFiles.get("users.csv"), header = True, sep = ",", inferSchema = True)

## Generate column age

reference_date = date(2017, 12, 31)
from pyspark.sql.types import IntegerType
def cal_age(born):
    return reference_date.year - born.year - ((reference_date.month, reference_date.day) < (born.month, born.day))
users_from_file = users_from_file.withColumn('age', cal_age_udf(fn.to_date(fn.col('birth_date'))))

## Generate column range

users_from_file1 = users_from_file.withColumn('range', fn.when(fn.col("age") <= 25, 1)fn.when(fn.col("age") <= 35, 2).fn.otherwise(3))

users_from_file1.show()

Then it returns an error

SyntaxError: invalid syntax
  File "<command-2296735704765764>", line 3
    users_from_file1 = users_from_file.withColumn('range', fn.when(fn.col("age") <= 25, 1)fn.when(fn.col("age") <= 35, 2).fn.otherwise(3))
                                                                                           ^
SyntaxError: invalid syntax

Could you please elaborate more on this nested when? This syntax of When is from this answer, but it does not work.

enter image description here

Aucun commentaire:

Enregistrer un commentaire