Python / Pandas: cómo hacer coincidir la lista de cadenas con una columna DataFrame

Quiero comparar dos columnas – Description y Employer . Quiero ver si alguna palabra clave en el Employer se encuentra en la columna Description . He desglosado la columna del Employer en palabras y he convertido a una lista. Ahora quiero ver si alguna de esas palabras está en la columna de Description correspondiente.

Ejemplo de entrada:

 print(df.head(25)) Date Description Amount AutoNumber \ 0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246 2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246 5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246 9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246 10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246 11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246 12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246 13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246 14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246 15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246 17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246 19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246 20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246 21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246 22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246 23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246 24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246 26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246 27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246 28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246 Employer 0 Cansel Survey Equipment 2 Cansel Survey Equipment 5 Cansel Survey Equipment 9 Cansel Survey Equipment 10 Cansel Survey Equipment 11 Cansel Survey Equipment 12 Cansel Survey Equipment 13 Cansel Survey Equipment 14 Cansel Survey Equipment 15 Cansel Survey Equipment 17 Cansel Survey Equipment 19 Cansel Survey Equipment 20 Cansel Survey Equipment 21 Cansel Survey Equipment 22 Cansel Survey Equipment 23 Cansel Survey Equipment 24 Cansel Survey Equipment 26 Cansel Survey Equipment 27 Cansel Survey Equipment 28 Cansel Survey Equipment 

Intenté algo como esto, pero no parece funcionar.

 df['Text_Search'] = df['Employer'].apply(lambda x: x.split(" ")) df['Match'] = np.where(df['Description'].str.contains("|".join(df['Text_Search'])), "Yes", "No") 

Mi salida deseada sería como se muestra a continuación:

  Date Description Amount AutoNumber \ 0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246 2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246 5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246 9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246 10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246 11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246 12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246 13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246 14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246 15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246 17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246 19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246 20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246 21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246 22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246 23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246 24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246 26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246 27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246 28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246 29 12/30/2014 WZ553 TFR?FR xxx8690 200.00 49246 32 12/29/2014 JW173 TFR?FR xxx8690 300.00 49246 33 12/24/2014 CANSEL SURVEY E PAY 1219.21 49246 34 12/24/2014 CANSEL SURVEY E PAY 434.84 49246 36 12/23/2014 WT002 TFR?FR xxx8690 160.00 49246 Employer Text_Search Match 0 Cansel Survey Equipment [Cansel, Survey, Equipment] No 2 Cansel Survey Equipment [Cansel, Survey, Equipment] No 5 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes 9 Cansel Survey Equipment [Cansel, Survey, Equipment] No 10 Cansel Survey Equipment [Cansel, Survey, Equipment] No 11 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes 12 Cansel Survey Equipment [Cansel, Survey, Equipment] No 13 Cansel Survey Equipment [Cansel, Survey, Equipment] No 14 Cansel Survey Equipment [Cansel, Survey, Equipment] No 15 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes 17 Cansel Survey Equipment [Cansel, Survey, Equipment] No 19 Cansel Survey Equipment [Cansel, Survey, Equipment] No 20 Cansel Survey Equipment [Cansel, Survey, Equipment] No 21 Cansel Survey Equipment [Cansel, Survey, Equipment] No 22 Cansel Survey Equipment [Cansel, Survey, Equipment] No 23 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes 24 Cansel Survey Equipment [Cansel, Survey, Equipment] No 26 Cansel Survey Equipment [Cansel, Survey, Equipment] No 27 Cansel Survey Equipment [Cansel, Survey, Equipment] No 28 Cansel Survey Equipment [Cansel, Survey, Equipment] No 29 Cansel Survey Equipment [Cansel, Survey, Equipment] No 32 Cansel Survey Equipment [Cansel, Survey, Equipment] No 33 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes 34 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes 36 Cansel Survey Equipment [Cansel, Survey, Equipment] No 

Aquí hay una solución legible usando un search_func individual:

 def search_func(row): matches = [test_value in row["Description"].lower() for test_value in row["Text_Search"]] if any(matches): return "Yes" else: return "No" 

Esta función se aplica entonces por filas:

 # create example data df = pd.DataFrame({"Description": ["CANSEL SURVEY E PAY", "JX154 TFR?FR xxx8690"], "Employer": ["Cansel Survey Equipment", "Cansel Survey Equipment"]}) print(df) Description Employer 0 CANSEL SURVEY E PAY Cansel Survey Equipment 1 JX154 TFR?FR xxx8690 Cansel Survey Equipment # create text searches and match column df["Text_Search"] = df["Employer"].str.lower().str.split() df["Match"] = df.apply(search_func, axis=1) # show result print(df) Description Employer Text_Search Match 0 CANSEL SURVEY E PAY Cansel Survey Equipment [cansel, survey, equipment] Yes 1 JX154 TFR?FR xxx8690 Cansel Survey Equipment [cansel, survey, equipment] No 

Aquí está la solución vectrorizada rápida y que ahorra memoria, que utiliza el método sklearn.feature_extraction.text.CountVectorizer :

 from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer(min_df=1, lowercase=True) X = vect.fit_transform(df['Employer']) cols_emp = vect.get_feature_names() X = vect.fit_transform(df['Description']) cols_desc = vect.get_feature_names() common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp] df['Match'] = (X.toarray()[:, common_cols_idx] == 1).any(1) 

Fuente DF:

 In [259]: df Out[259]: Date Description Amount AutoNumber Employer 0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246 Cansel Survey Equipment 2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246 Cansel Survey Equipment 5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment 9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246 Cansel Survey Equipment 10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246 Cansel Survey Equipment 11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246 Cansel Survey Equipment 12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment 13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246 Cansel Survey Equipment 14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment 15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246 Cansel Survey Equipment 17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246 Cansel Survey Equipment 19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246 Cansel Survey Equipment 20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246 Cansel Survey Equipment 21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment 22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246 Cansel Survey Equipment 23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment 24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246 Cansel IR453 26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246 Cansel Survey Equipment 27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246 Cansel Survey Equipment 28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246 Cansel Survey HU521 

Resultado:

 In [261]: df Out[261]: Date Description Amount AutoNumber Employer Match 0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246 Cansel Survey Equipment False 2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246 Cansel Survey Equipment False 5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment True 9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246 Cansel Survey Equipment False 10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246 Cansel Survey Equipment False 11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246 Cansel Survey Equipment True 12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False 13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246 Cansel Survey Equipment False 14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False 15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246 Cansel Survey Equipment True 17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246 Cansel Survey Equipment False 19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246 Cansel Survey Equipment False 20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246 Cansel Survey Equipment False 21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False 22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246 Cansel Survey Equipment False 23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment True 24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246 Cansel IR453 True 26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246 Cansel Survey Equipment False 27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246 Cansel Survey Equipment False 28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246 Cansel Survey HU521 True 

Algunas explicaciones:

 In [266]: cols_desc Out[266]: ['cansel', 'fr', 'hi540', 'hq010', 'hu204', 'hu521', 'ir453', 'jh401', 'jj500', 'jx154', 'pay', 'rq574', 'survey', 'tfr', 'ue200', 'uo414', 'uq263', 'ut022', 'ut460', 'wu455', 'ww120', 'xxx8690'] In [267]: cols_emp Out[267]: ['cansel', 'equipment', 'hu521', 'ir453', 'survey'] In [268]: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp] In [269]: common_cols_idx Out[269]: [0, 5, 6, 12] In [270]: X.toarray() Out[270]: array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1], [0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1], [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1], [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1], [0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1], [0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1], [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]], dtype=int64) In [271]: X.toarray()[:, common_cols_idx] Out[271]: array([[0, 0, 0, 0], [0, 0, 0, 0], [1, 0, 0, 1], [0, 0, 0, 0], [0, 0, 0, 0], [1, 0, 0, 1], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [1, 0, 0, 1], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [1, 0, 0, 1], [0, 0, 1, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 1, 0, 0]], dtype=int64) In [272]: X.toarray()[:, common_cols_idx] == 1 Out[272]: array([[False, False, False, False], [False, False, False, False], [ True, False, False, True], [False, False, False, False], [False, False, False, False], [ True, False, False, True], [False, False, False, False], [False, False, False, False], [False, False, False, False], [ True, False, False, True], [False, False, False, False], [False, False, False, False], [False, False, False, False], [False, False, False, False], [False, False, False, False], [ True, False, False, True], [False, False, True, False], [False, False, False, False], [False, False, False, False], [False, True, False, False]], dtype=bool) In [273]: (X.toarray()[:, common_cols_idx] == 1).any(1) Out[273]: array([False, False, True, False, False, True, False, False, False, True, False, False, False, False, False, True, True, Fals e, False, True], dtype=bool) 

Aquí hay una solución, que divide el texto en minúsculas y usa la intersección de los conjuntos para cada fila:

 In [160]: x['Match'] = x.Description.str.lower().str.split().map(set).to_frame('desc') \ ...: .apply(lambda r: (x.Employer.str.lower().str.split().map(set) & r.desc).any(), ...: axis=1) ...: In [161]: x Out[161]: Date Description Amount AutoNumber Employer Match 0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246 Cansel Survey Equipment False 2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246 Cansel Survey Equipment False 5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment True 9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246 Cansel Survey Equipment False 10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246 Cansel Survey Equipment False 11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246 Cansel Survey Equipment True 12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False 13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246 Cansel Survey Equipment False 14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False 15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246 Cansel Survey Equipment True 17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246 Cansel Survey Equipment False 19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246 Cansel Survey Equipment False 20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246 Cansel Survey Equipment False 21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False 22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246 Cansel Survey Equipment False 23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment True 24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246 Cansel Survey Equipment False 26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246 Cansel Survey Equipment False 27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246 Cansel Survey Equipment False 28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246 Cansel Survey Equipment False 

PS es bastante lento, ya que utiliza el método .apply(..., axis=1) no vectorizado

Comparación de tiempos para diferentes soluciones.

Preparemos un poco más grande DF – 2.000 filas :

 In [3]: df = pd.concat([df] * 10**2, ignore_index=True) In [4]: df.shape Out[4]: (2000, 5) 

Solución 1: df.apply(..., axis=1) :

 df["Text_Search"] = df.Employer.str.lower().str.split().map(set) In [15]: %%timeit ...: df.Description.str.lower().str.split().map(set).to_frame('desc') \ ...: .apply(lambda r: (df["Text_Search"] & r.desc).any(), ...: axis=1) ...: 1 loop, best of 3: 5.06 s per loop 

Solución 2: CountVectorizer

 from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer(min_df=1, lowercase=True) In [8]: %%timeit ...: X = vect.fit_transform(df['Employer']) ...: cols_emp = vect.get_feature_names() ...: X = vect.fit_transform(df['Description']) ...: cols_desc = vect.get_feature_names() ...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp] ...: (X.toarray()[:, common_cols_idx] == 1).any(1) ...: 10 loops, best of 3: 88.2 ms per loop 

Solución 3: df.apply(search_func, axis=1)

 df["Text_Search"] = df["Employer"].str.lower().str.split() In [12]: %%timeit ...: df.apply(search_func, axis=1) ...: 1 loop, best of 3: 362 ms per loop 

NOTA: La Solution 1 es demasiado lenta, por lo que no “cronometraré” esta solución para DF más grandes


Comparación de df.apply(search_func, axis=1) y CountVectorizer para 20.000 filas DF :

 In [16]: df = pd.concat([df] * 10, ignore_index=True) In [17]: df.shape Out[17]: (20000, 6) In [20]: %%timeit ...: df.apply(search_func, axis=1) ...: 1 loop, best of 3: 3.66 s per loop In [21]: %%timeit ...: X = vect.fit_transform(df['Employer']) ...: cols_emp = vect.get_feature_names() ...: X = vect.fit_transform(df['Description']) ...: cols_desc = vect.get_feature_names() ...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp] ...: (X.toarray()[:, common_cols_idx] == 1).any(1) ...: 1 loop, best of 3: 825 ms per loop 

Comparación de df.apply(search_func, axis=1) y CountVectorizer para 200.000 filas DF :

 In [22]: df = pd.concat([df] * 10, ignore_index=True) In [23]: df.shape Out[23]: (200000, 6) In [24]: %%timeit ...: df.apply(search_func, axis=1) ...: 1 loop, best of 3: 36.8 s per loop In [25]: %%timeit ...: X = vect.fit_transform(df['Employer']) ...: cols_emp = vect.get_feature_names() ...: X = vect.fit_transform(df['Description']) ...: cols_desc = vect.get_feature_names() ...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp] ...: (X.toarray()[:, common_cols_idx] == 1).any(1) ...: 1 loop, best of 3: 8.28 s per loop 

Conclusión: la solución CountVectorized es apporx. 4.44 veces más rápido en comparación con df.apply(search_func, axis=1)