I have python file called test.py. In this file I will execute some pyspark commands.
#!/usr/bin/env python
import sys
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
# create a data frame from hive tables
df=sqlContext.table("testing.test")
# register the data frame as temp table
df.registerTempTable('mytempTable')
# find number of records in data frame
records = df.count()
print "records='%s'" %records
Now I want to do
if records < 1000000
then do
sqlContext.sql("create table {}.{} stored as parquet as select * from mytempTable".format(hivedb,table))
if records > 1000000
then do
sqlContext.sql("create table {}.{} stored as parquet as select * from mytempTable where id <= 1000000".format(hivedb,table))
sqlContext.sql("insert into table {}.{} select * from mytempTable where id > 1000000 and id <= 2000000".format(hivedb,table))
sqlContext.sql("insert into table {}.{} select * from mytempTable where id > 2000000 and id <= 3000000".format(hivedb,table))
sqlContext.sql("insert into table {}.{} select * from mytempTable where id > 3000000 and id <= 4000000".format(hivedb,table))
sqlContext.sql("insert into table {}.{} select * from mytempTable where id > 4000000 and id <= 5000000".format(hivedb,table))
sqlContext.sql("insert into table {}.{} select * from mytempTable where id > 5000000 and id <= 6000000".format(hivedb,table))
and so on until the last million record
How to use if statement in this script.
How to generate similar lines of code until the last million in records?
As you can see there is a lot of manual work involved.
I am new to python and still learning it. Is there a way to simplify the code
Aucun commentaire:
Enregistrer un commentaire