kedro.contrib.decorators.pandas_to_spark¶
-
kedro.contrib.decorators.pandas_to_spark(spark)[source]¶ Inspects the decorated function’s inputs and converts all pandas DataFrame inputs to spark DataFrames.
Note that in the example below we have enabled
spark.sql.execution.arrow.enabled. For this to work, you should firstpip install pyarrowand addpyarrowtorequirements.txt. Enabling this option makes the convertion between pyspark <-> DataFrames much faster.Parameters: spark ( SparkSession) –The spark session singleton object to use for the creation of the pySpark DataFrames. A possible pattern you can use here is the following:
spark.py
from pyspark.sql import SparkSession def get_spark(): return ( SparkSession.builder .master("local[*]") .appName("kedro") .config("spark.driver.memory", "4g") .config("spark.driver.maxResultSize", "3g") .config("spark.sql.execution.arrow.enabled", "true") .getOrCreate() )
nodes.py
from spark import get_spark @pandas_to_spark(get_spark()) def node_1(data): data.show() # data is pyspark.sql.DataFrame
Return type: CallableReturns: The original function with any pandas DF inputs translated to spark.