PySpark

since 2022-01-27

<html> <iframe src="www.slideshare.net/slideshow/embed_code/key/aDkJhgeS40C9cP" width="595" height="485" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe> <div style="margin-bottom:5px"> <strong> <a href="www.slideshare.net/nishimotz/220126-pythondatalakespark" title="220126 python-datalake-spark" target="_blank">220126 python-datalake-spark</a> </strong> from <strong><a href="www.slideshare.net/nishimotz" target="_blank">Takuya Nishimoto</a></strong> </div> </html> AWS Athena の CTAS が S3 に作った Parquet + Snappy のファイルを読んでみる。 <code python> from pyspark import SparkContext from pyspark.sql import SparkSession spark_context = SparkContext() spark = SparkSession(spark_context) filename = "20220124_093108_00027_*_----" df = spark.read.parquet(filename) type(df) # ⇒ pyspark.sql.dataframe.DataFrame df.show() df.createOrReplaceTempView("chap7_japan_ctas") spark.sql("select * from chap7_japan_ctas").collect() df.count() df2 = df.toPandas() type(df2) # ⇒ pandas.core.frame.DataFrame df2.tail(10) </code>

ja.nishimotz.com

目次

PySpark