I have a table of utility bill segments, where each segment tells a month's kwh usage. I'd like to sum groups of 12 together to get annual bills.
The variable bill_cd indicates the last segment of the annual bill. Data download
cust_id kwh bill_cd
0 3333 1104.388683 ?
1 3333 1498.007305 ?
2 3333 662.044822 ?
3 3333 661.342412 ?
4 3333 494.070683 ?
5 3333 300.147843 ?
6 3333 836.677007 ?
7 3333 864.608037 ?
8 3333 933.232845 ?
9 3333 1191.025358 ?
10 3333 1507.119588 ?
11 3333 1980.653631 BILL
12 3333 2621.387010 ?
13 3333 2552.053789 BILL
The problem: as can be seen, there aren't 12 segments per annual bill. In this snippet, I'd like to only get the sum of the first annual bill since it will include 12 segments, but not the second because it only has two segments.
I wrote this query in python:
import pandas as pd, pandasql as ps
df = pd.read_csv('so_ex.csv')
q1 = """
select cust_id, kwh, bill_cd,
sum(kwh) over (partition by cust_id
rows between 11 preceding and current row) as kwh_total,
count(kwh) over (partition by cust_id
rows between 11 preceding and current row) as bseg_count
from df
"""
ps.sqldf(q1, locals())
And the result is below. Is there a way to make the count variable reset upon hitting another instance of "BILL" so that the bill segment count will go to 1. This way, I could filter the table later where bseg_count=12.
cust_id kwh bill_cd kwh_total bseg_count
0 3333 1104.388683 ? 1104.388683 1
1 3333 1498.007305 ? 2602.395988 2
2 3333 662.044822 ? 3264.440810 3
3 3333 661.342412 ? 3925.783223 4
4 3333 494.070683 ? 4419.853906 5
5 3333 300.147843 ? 4720.001748 6
6 3333 836.677007 ? 5556.678755 7
7 3333 864.608037 ? 6421.286792 8
8 3333 933.232845 ? 7354.519638 9
9 3333 1191.025358 ? 8545.544996 10
10 3333 1507.119588 ? 10052.664584 11
11 3333 1980.653631 BILL 12033.318215 12
12 3333 2621.387010 ? 13550.316542 12
13 3333 2552.053789 BILL 14604.363026 12
Aucun commentaire:
Enregistrer un commentaire