I have python file called test.py. In this file I will execute some pyspark commands.
#!/usr/bin/env python
import sys
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
# create a data frame from hive tables
df=sqlContext.table("testing.test")
# register the data frame as temp table
df.registerTempTable('mytempTable')
# find number of records in data frame
records = df.count()
print "records='%s'" %records
if records < 1000000:
sqlContext.sql("create table {}.{} stored as parquet as select * from mytempTable".format(hivedb,table))
else:
sqlContext.sql("create table {}.{} stored as parquet as select * from mytempTable where id <= 1000000".format(hivedb,table))
sqlContext.sql("insert into table {}.{} select * from mytempTable where id > 1000000 and id <= 2000000".format(hivedb,table))
sqlContext.sql("insert into table {}.{} select * from mytempTable where id > 2000000 and id <= 3000000".format(hivedb,table))
sqlContext.sql("insert into table {}.{} select * from mytempTable where id > 3000000 and id <= 4000000".format(hivedb,table))
sqlContext.sql("insert into table {}.{} select * from mytempTable where id > 4000000 and id <= 5000000".format(hivedb,table))
and so on till the last million
In the if-else statement after else The code I have written manually.
I want to generate this part of code in the script automatically.
How to generate similar lines of code until the last million in records?
I am new to python and still learning it. Is there a way to simplify the code
Aucun commentaire:
Enregistrer un commentaire