Utilizando graphframes con PyCharm

He pasado casi 2 días desplazándome por Internet y no he podido solucionar este problema. Estoy tratando de instalar el paquete graphframes (Versión: 0.2.0-spark2.0-s_2.11) para ejecutar con chispa a través de PyCharm, pero, a pesar de mis mejores esfuerzos, ha sido imposible.

He intentado casi todo. Por favor, sepa que también he revisado este sitio aquí antes de publicar una respuesta.

Aquí está el código que estoy tratando de ejecutar:

# IMPORT OTHER LIBS -------------------------------------------------------- import os import sys import pandas as pd # IMPORT SPARK ------------------------------------------------------------------------------------# # Path to Spark source folder USER_FILE_PATH = "/Users/" SPARK_PATH = "/PycharmProjects/GenesAssociation" SPARK_FILE = "/spark-2.0.0-bin-hadoop2.7" SPARK_HOME = USER_FILE_PATH + SPARK_PATH + SPARK_FILE os.environ['SPARK_HOME'] = SPARK_HOME # Append pySpark to Python Path sys.path.append(SPARK_HOME + "/python") sys.path.append(SPARK_HOME + "/python" + "/lib/py4j-0.10.1-src.zip") try: from pyspark import SparkContext from pyspark import SparkConf from pyspark.sql import SQLContext from pyspark.graphframes import GraphFrame except ImportError as ex: print "Can not import Spark Modules", ex sys.exit(1) # GLOBAL VARIABLES --------------------------------------------------------- -----------------------# SC = SparkContext('local') SQL_CONTEXT = SQLContext(SC) # MAIN CODE ---------------------------------------------------------------------------------------# if __name__ == "__main__": # Main Path to CSV files DATA_PATH = '/PycharmProjects/GenesAssociation/data/' FILE_NAME = 'gene_gene_associations_50k.csv' # LOAD DATA CSV USING PANDAS -----------------------------------------------------------------# print "STEP 1: Loading Gene Nodes -------------------------------------------------------------" # Read csv file and load as df GENES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME, usecols=['OFFICIAL_SYMBOL_A'], low_memory=True, iterator=True, chunksize=1000) # Concatenate chunks into list & convert to dataFrame GENES_DF = pd.DataFrame(pd.concat(list(GENES), ignore_index=True)) # Remove duplicates GENES_DF_CLEAN = GENES_DF.drop_duplicates(keep='first') # Name Columns GENES_DF_CLEAN.columns = ['gene_id'] # Output dataFrame print GENES_DF_CLEAN # Create vertices VERTICES = SQL_CONTEXT.createDataFrame(GENES_DF_CLEAN) # Show some vertices print VERTICES.take(5) print "STEP 2: Loading Gene Edges -------------------------------------------------------------" # Read csv file and load as df EDGES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME, usecols=['OFFICIAL_SYMBOL_A', 'OFFICIAL_SYMBOL_B', 'EXPERIMENTAL_SYSTEM'], low_memory=True, iterator=True, chunksize=1000) # Concatenate chunks into list & convert to dataFrame EDGES_DF = pd.DataFrame(pd.concat(list(EDGES), ignore_index=True)) # Name Columns EDGES_DF.columns = ["src", "dst", "rel_type"] # Output dataFrame print EDGES_DF # Create vertices EDGES = SQL_CONTEXT.createDataFrame(EDGES_DF) # Show some edges print EDGES.take(5) g = gf.GraphFrame(VERTICES, EDGES) 

No hace falta decir que he intentado incluir el directorio de cuadros gráficos (consulte aquí para comprender lo que hice) en el directorio pyspark de spark. Pero parece que esto no es suficiente … Cualquier otra cosa que haya intentado acaba de fallar. Agradecería alguna ayuda con esto. Puedes ver debajo el mensaje de error que estoy recibiendo:

 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). 16/09/19 12:46:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/09/19 12:46:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. STEP 1: Loading Gene Nodes ------------------------------------------------------------- gene_id 0 MAP2K4 1 MYPN 2 ACVR1 3 GATA2 4 RPA2 5 ARF1 6 ARF3 8 XRN1 9 APP 10 APLP1 11 CITED2 12 EP300 13 APOB 14 ARRB2 15 CSF1R 16 PRRC2A 17 LSM1 18 SLC4A1 19 BCL3 20 ADRB1 21 BRCA1 25 ARVCF 26 PCBD1 27 PSEN2 28 CAPN3 29 ITPR1 30 MAGI1 31 RB1 32 TSG101 33 ORC1 ... ... 49379 WDR26 49380 WDR5B 49382 NLE1 49383 WDR12 49385 WDR53 49386 WDR59 49387 WDR61 49409 CHD6 49422 DACT1 49424 KMT2B 49438 SMARCA1 49459 DCLRE1A 49469 F2RL1 49472 SENP8 49475 TSPY1 49479 SERPINB5 49521 HOXA11 49548 SYF2 49553 FOXN3 49557 MLANA 49608 REPIN1 49609 GMNN 49670 HIST2H2BE 49767 BCL7C 49797 SIRT3 49810 KLF4 49858 RHO 49896 MAGEA2 49907 SUV420H2 49958 SAP30L [6025 rows x 1 columns] 16/09/19 12:46:08 WARN TaskSetManager: Stage 0 contains a task of very large size (107 KB). The maximum recommended task size is 100 KB. [Row(gene_id=u'MAP2K4'), Row(gene_id=u'MYPN'), Row(gene_id=u'ACVR1'), Row(gene_id=u'GATA2'), Row(gene_id=u'RPA2')] STEP 2: Loading Gene Edges ------------------------------------------------------------- src dst rel_type 0 MAP2K4 FLNC Two-hybrid 1 MYPN ACTN2 Two-hybrid 2 ACVR1 FNTA Two-hybrid 3 GATA2 PML Two-hybrid 4 RPA2 STAT3 Two-hybrid 5 ARF1 GGA3 Two-hybrid 6 ARF3 ARFIP2 Two-hybrid 7 ARF3 ARFIP1 Two-hybrid 8 XRN1 ALDOA Two-hybrid 9 APP APPBP2 Two-hybrid 10 APLP1 DAB1 Two-hybrid 11 CITED2 TFAP2A Two-hybrid 12 EP300 TFAP2A Two-hybrid 13 APOB MTTP Two-hybrid 14 ARRB2 RALGDS Two-hybrid 15 CSF1R GRB2 Two-hybrid 16 PRRC2A GRB2 Two-hybrid 17 LSM1 NARS Two-hybrid 18 SLC4A1 SLC4A1AP Two-hybrid 19 BCL3 BARD1 Two-hybrid 20 ADRB1 GIPC1 Two-hybrid 21 BRCA1 ATF1 Two-hybrid 22 BRCA1 MSH2 Two-hybrid 23 BRCA1 BARD1 Two-hybrid 24 BRCA1 MSH6 Two-hybrid 25 ARVCF CDH15 Two-hybrid 26 PCBD1 CACNA1C Two-hybrid 27 PSEN2 CAPN1 Two-hybrid 28 CAPN3 TTN Two-hybrid 29 ITPR1 CA8 Two-hybrid ... ... ... ... 49969 SAP30 HDAC3 Affinity Capture-Western 49970 BRCA1 RBBP8 Co-localization 49971 BRCA1 BRCA1 Biochemical Activity 49972 SET TREX1 Co-purification 49973 SET TREX1 Reconstituted Complex 49974 PLAGL1 EP300 Reconstituted Complex 49975 PLAGL1 CREBBP Reconstituted Complex 49976 EP300 PLAGL1 Affinity Capture-Western 49977 MTA1 ESR1 Reconstituted Complex 49978 SIRT2 EP300 Affinity Capture-Western 49979 EP300 SIRT2 Affinity Capture-Western 49980 EP300 HDAC1 Affinity Capture-Western 49981 EP300 SIRT2 Biochemical Activity 49982 MIER1 CREBBP Reconstituted Complex 49983 SMARCA4 SIN3A Affinity Capture-Western 49984 SMARCA4 HDAC2 Affinity Capture-Western 49985 ESR1 NCOA6 Affinity Capture-Western 49986 ESR1 TOP2B Affinity Capture-Western 49987 ESR1 PRKDC Affinity Capture-Western 49988 ESR1 PARP1 Affinity Capture-Western 49989 ESR1 XRCC5 Affinity Capture-Western 49990 ESR1 XRCC6 Affinity Capture-Western 49991 PARP1 TOP2B Affinity Capture-Western 49992 PARP1 PRKDC Affinity Capture-Western 49993 PARP1 XRCC5 Affinity Capture-Western 49994 PARP1 XRCC6 Affinity Capture-Western 49995 SIRT3 XRCC6 Affinity Capture-Western 49996 SIRT3 XRCC6 Reconstituted Complex 49997 SIRT3 XRCC6 Biochemical Activity 49998 HDAC1 PAX3 Affinity Capture-Western [49999 rows x 3 columns] 16/09/19 12:46:11 WARN TaskSetManager: Stage 1 contains a task of very large size (1211 KB). The maximum recommended task size is 100 KB. [Row(src=u'MAP2K4', dst=u'FLNC', rel_type=u'Two-hybrid'), Row(src=u'MYPN', dst=u'ACTN2', rel_type=u'Two-hybrid'), Row(src=u'ACVR1', dst=u'FNTA', rel_type=u'Two-hybrid'), Row(src=u'GATA2', dst=u'PML', rel_type=u'Two-hybrid'), Row(src=u'RPA2', dst=u'STAT3', rel_type=u'Two-hybrid')] Traceback (most recent call last): File "/Users/username/PycharmProjects/GenesAssociation/__init__.py", line 99, in  g = gf.GraphFrame(VERTICES, EDGES) File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 62, in __init__ self._jvm_gf_api = _java_api(self._sc) File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 34, in _java_api return jsc._jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \ File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o50.loadClass. : java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745) Process finished with exit code 1 

Gracias por adelantado.

Puedes configurar PYSPARK_SUBMIT_ARGS en tu código

 os.environ["PYSPARK_SUBMIT_ARGS"] = ( "--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell" ) spark = SparkSession.builder.getOrCreate() 

o en la configuración de ejecución de edición de PyCharm ( Ejecutar -> Editar configuración -> Elegir configuración -> Seleccionar pestaña de configuración -> Elegir variables de entorno -> Agregar PYSPARK_SUBMIT_ARGS ):

introduzca la descripción de la imagen aquí

con un mínimo ejemplo de trabajo:

 import os import sys SPARK_HOME = ... os.environ["SPARK_HOME"] = SPARK_HOME # os.environ["PYSPARK_SUBMIT_ARGS"] = ... If not set in PyCharm config sys.path.append(os.path.join(SPARK_HOME, "python")) sys.path.append(os.path.join(SPARK_HOME, "python/lib/py4j-0.10.3-src.zip")) from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() v = spark.createDataFrame([("a", "foo"), ("b", "bar"),], ["id", "attr"]) e = spark.createDataFrame([("a", "b", "foobar")], ["src", "dst", "rel"]) from graphframes import * g = GraphFrame(v, e) g.inDegrees.show() spark.stop() 

También puede agregar los packages o los jars a su spark-defaults.conf .

Si usa Python 3 con graphframes 0.2, existe un problema conocido al extraer bibliotecas de Python de JAR, por lo que tendrá que hacerlo manualmente. Por ejemplo, puede descargar el archivo JAR, descomprimirlo y asegurarse de que el directorio raíz con graphframes esté en su ruta de Python. Esto se ha corregido en los graphframes 0.3.