¿Cómo resumir en diferentes combinaciones groupby?

Estoy comstackndo una tabla de cultivos top-3 por condado. Algunos condados tienen las mismas variedades de cultivos en el mismo orden. Otros condados tienen las mismas variedades de cultivos en un orden diferente.

df1 = pd.DataFrame( { "County" : ["Harney", "Baker", "Wheeler", "Hood River", "Wasco" , "Morrow","Union","Lake"] , "Crop1" : ["grain", "melons", "melons", "apples", "pears", "raddish","pears","pears"], "Crop2" : ["melons","grain","grain","melons","carrots","pears","carrots","carrots"], "Crop3": ["apples","apples","apples","grain","raddish","carrots","raddish","raddish"], "Total_pop": [2000,1500,3000,1500,2000,2500,2700,2000]} ) 

Puedo hacer un groupby en Crop1, Crop2 y Crop3 y obtener la sum de total_pop:

 df1_grouped=df1.groupby(['Crop1',"Crop2","Crop3"])['Total_pop'].sum().reset_index() 

Eso me da el total para combinaciones específicas de los cultivos:

 df1_grouped apples melons grain 1500 grain melons apples 2000 melons grain apples 4500 pears carrots raddish 6700 raddish pears carrots 2500 

Lo que me gustaría, sin embargo, es obtener la población total en diferentes combinaciones de cultivos, independientemente de si el cultivo de la lista fue crop1, crop2 o crop3. El resultado deseado sería este:

 apples melons grain 8000 pears carrots raddish 9200 

Gracias por cualquier orientación.

Método 1:

Combina las columnas de crop

 >>> df1['combined_temp'] = df1.apply(lambda x : list([x['Crop1'], ... x['Crop2'], ... x['Crop3']]),axis=1) >>> df1.head() County Crop1 Crop2 Crop3 Total_pop combined_temp 0 Harney grain melons apples 2000 [grain, melons, apples] 1 Baker melons grain apples 1500 [melons, grain, apples] 2 Wheeler melons grain apples 3000 [melons, grain, apples] 3 Hood River apples melons grain 1500 [apples, melons, grain] 4 Wasco pears carrots raddish 2000 [pears, carrots, raddish] 

que sea una tupla ordenada

 >>> df1['sorted'] = df1.apply(lambda x : tuple(sorted(x['combined_temp'])),axis=1) >>> df1.head() County Crop1 Crop2 ... Total_pop combined_temp sorted 0 Harney grain melons ... 2000 [grain, melons, apples] (apples, grain, melons) 1 Baker melons grain ... 1500 [melons, grain, apples] (apples, grain, melons) 2 Wheeler melons grain ... 3000 [melons, grain, apples] (apples, grain, melons) 3 Hood River apples melons ... 1500 [apples, melons, grain] (apples, grain, melons) 4 Wasco pears carrots ... 2000 [pears, carrots, raddish] (carrots, pears, raddish) 

luego proceder a su grupo normal por la operación

 >>> df1_grouped = df1.groupby(['sorted'])['Total_pop'].sum().reset_index() >>> df1_grouped sorted Total_pop 0 (apples, grain, melons) 8000 1 (carrots, pears, raddish) 9200 

Método 2: una versión corta basada en la respuesta de aws-apprentice

 df = df1.copy() grouping_cols = ['Crop1', 'Crop2', 'Crop3'] df[grouping_cols] = pd.DataFrame(df.loc[:, grouping_cols] \ .apply(set, axis=1) \ .apply(sorted) .values \ .tolist(), columns=grouping_cols) >>> df.head() County Crop1 Crop2 Crop3 Total_pop 0 Harney apples grain melons 2000 1 Baker apples grain melons 1500 2 Wheeler apples grain melons 3000 3 Hood River apples grain melons 1500 4 Wasco carrots pears raddish 2000 

ahora toma grupo por grupo por

 >>> df.groupby(grouping_cols).Total_pop.sum() Crop1 Crop2 Crop3 apples grain melons 8000 carrots pears raddish 9200 Name: Total_pop, dtype: int64 

Pero personalmente prefiero esta respuesta usando numpy

Aquí hay una forma de hacerlo.

Primero obtengamos los valores únicos en las columnas y luego reasignemos estos valores nuevamente al DataFrame. Lo realizaremos en una copia de los datos originales, ya que es posible que deba conservar los datos originales.

 df = df1.copy() to_sum = ['Crop1', 'Crop2', 'Crop3'] df[to_sum] = pd.DataFrame(df.loc[:, to_sum] \ .apply(set, axis=1) \ .apply(sorted) \ .values \ .tolist(), columns=to_sum) print(df) County Crop1 Crop2 Crop3 Total_pop 0 Harney grain apples melons 2000 1 Baker grain apples melons 1500 2 Wheeler grain apples melons 3000 3 Hood River grain apples melons 1500 4 Wasco pears carrots raddish 2000 5 Morrow pears carrots raddish 2500 6 Union pears carrots raddish 2700 7 Lake pears carrots raddish 2000 

Ahora podemos realizar nuestro groupby para obtener los resultados deseados.

 df.groupby(to_sum).Total_pop.sum() Crop1 Crop2 Crop3 apples grain melons 8000 carrots pears raddish 9200 Name: Total_pop, dtype: int64 

Dado que sus datos parecen garantizar 3 cultivos únicos por país (“Estoy comstackndo una tabla de los 3 cultivos principales por condado”), es suficiente para ordenar los valores y volver a asignar.

 import numpy as np cols = ['Crop1', 'Crop2', 'Crop3'] df1[cols] = np.sort(df1[cols].values, axis=1) County Crop1 Crop2 Crop3 Total_pop 0 Harney apples grain melons 2000 1 Baker apples grain melons 1500 2 Wheeler apples grain melons 3000 3 Hood River apples grain melons 1500 4 Wasco carrots pears raddish 2000 5 Morrow carrots pears raddish 2500 6 Union carrots pears raddish 2700 7 Lake carrots pears raddish 2000 

Luego para resumir:

 df1.groupby(cols).sum() # Total_pop #Crop1 Crop2 Crop3 #apples grain melons 8000 #carrots pears raddish 9200 

El beneficio es que evita Series.apply o .apply(axis=1) . Para DataFrames más DataFrames , la diferencia de rendimiento es notable:

 df1 = pd.concat([df1]*10000, ignore_index=True) cols = ['Crop1', 'Crop2', 'Crop3'] %timeit df1[cols] = np.sort(df1[cols].values, axis=1) #36.1 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) to_sum = ['Crop1', 'Crop2', 'Crop3'] %timeit df1[to_sum] = pd.DataFrame(df1.loc[:, to_sum].apply(set, axis=1).apply(list).values.tolist(), columns=to_sum) #1.41 s ± 51.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
 import pandas as pd df = pd.DataFrame( { "County" : ["Harney", "Baker", "Wheeler", "Hood River", "Wasco" , "Morrow","Union","Lake"] , "Crop1" : ["grain", "melons", "melons", "apples", "pears", "raddish","pears","pears"], "Crop2" : ["melons","grain","grain","melons","carrots","pears","carrots","carrots"], "Crop3": ["apples","apples","apples","grain","raddish","carrots","raddish","raddish"], "Total_pop": [2000,1500,3000,1500,2000,2500,2700,2000]} ) print(df) df["Merged"] = df[["Crop1", "Crop2", "Crop3"]].apply(lambda x: ','.join(x.dropna().astype(str).values), axis=1).str.split(",") df["Merged"] = df["Merged"].sort_values().apply(lambda x: sorted(x)).apply(lambda x: ",".join(x)) df[["x", "y", "z"]] = df["Merged"].str.split(",", expand=True) df1=df.groupby(['x',"y","z"])['Total_pop'].sum().reset_index() print(df1) 

Salida:

  County Crop1 Crop2 Crop3 Total_pop Harney grain melons apples 2000 Baker melons grain apples 1500 Wheeler melons grain apples 3000 Hood River apples melons grain 1500 Wasco pears carrots raddish 2000 Morrow raddish pears carrots 2500 Union pears carrots raddish 2700 Lake pears carrots raddish 2000 xyz Total_pop apples grain melons 8000 carrots pears raddish 9200