I have dataset as below. I am trying to group them by district and find the total amount for each product. I want to extend my calculations to find the cumulative sales amount as well as adding total sales amount in that district.
Dataset:
district item salesAmount
Arba pen 10
Arba pen 20
Arba pencil 30
Arba laptop 10000
Arba coil 100
Arba coil 200
Cebu pen 100
Cebu pen 20
Cebu laptop 20000
Cebu laptop 20000
Cebu fruit 800
Cebu oil 300
I could group by district and find the total amount for each product as below
df.groupby(['district', 'item']).agg({'salesAmount': 'sum'})
Results as below:
district item salesAmount
Arba laptop 10000
Arba coil 300
Arba pencil 30
Arba pen 30
Cebu laptop 40000
Cebu fruit 800
Cebu oil 300
Cebu pen 120
I want to order from highest amount to lowest amount for each district first.
Then add cumulative and total sales amount column as below:(per district)
district item salesAmount cumsalesAmount totaldistrictAmount
Arba laptop 10000 10000 10360
Arba coil 300 10300 10360
Arba pencil 30 10330 10360
Arba pen 30 10360 10360
Cebu laptop 40000 40000 41220
Cebu fruit 800 40800 41220
Cebu oil 300 41100 41220
Cebu pen 120 41220 41220
Thanks.
First aggregate sum
per both columns:
print (df.dtypes)
district object
item object
salesAmount int64
dtype: object
df1 = df.groupby(['district', 'item'], as_index=False)['salesAmount'].sum()
Or:
df1 = df.groupby(['district', 'item'], as_index=False).agg({'salesAmount': 'sum'})
print (df1)
district item salesAmount
0 Arba coil 300
1 Arba laptop 10000
2 Arba pen 30
3 Arba pencil 30
4 Cebu fruit 800
5 Cebu laptop 40000
6 Cebu oil 300
7 Cebu pen 120
Then sort by both columns with DataFrame.sort_values
, use GroupBy.cumsum
and last GroupBy.transform
with sum
:
df1 = df1.sort_values(['district','salesAmount'], ascending=[True, False])
df1['cumsalesAmount'] = df1.groupby('district')['salesAmount'].cumsum()
df1['totaldistrictAmount'] = df1.groupby('district')['salesAmount'].transform('sum')
#alternative
#df1['totaldistrictAmount'] = df1.groupby('district')['cumsalesAmount'].transform('last')
print (df1)
district item salesAmount cumsalesAmount totaldistrictAmount
1 Arba laptop 10000 10000 10360
0 Arba coil 300 10300 10360
2 Arba pen 30 10330 10360
3 Arba pencil 30 10360 10360
5 Cebu laptop 40000 40000 41220
4 Cebu fruit 800 40800 41220
6 Cebu oil 300 41100 41220
7 Cebu pen 120 41220 41220
Thanks. Your first line of code gives me error as below:
TypeError: groupby() got an unexpected keyword argument 'level'
@Lilly - What is
print (df.info())
before my solution?district 19551 non-null object item 19551 non-null object salesAmount 19551 non-null object
@Lilly - I think
df1 = df.groupby(['district', 'item']).agg({'salesAmount': 'sum'})
is necessary change todf1 = df.groupby(['district', 'item'], as_index=False).agg({'salesAmount': 'sum'})
Shoot... Actually my mistake.. I am working on pysaprk df and used Kolas to convert pyspark df to kolas_df and trying to use pandas functions. The issue with Kolas is
transform
function is not available. Hence I converted my pyspark df to pandas and now your code works. Should be careful while using Koalas. Thanks for all your inputs and really appreciate your time.