How can I make only one file in spark to s3?
06:48 28 May 2023

I have lots of csv files. After using spark sql, I want to make the one csv file.

For example I have news1.csv, news2.csv, news3.csv, ect in S3. I download into spark sql from s3, and createDataframe. After using spark sql, I want to upload s3 with only one csv file.

At first I tried spark.mode(append) for each csv file.

df = spark.sql(...)

df.write \
  .option("header","true") \
  .option("encoding", "UTF-8") \
  .mode("append") \
  .option("extracopyoptions", "TIMEFORMAT 'auto'") \
  .csv("s3a://news/test1")

But in this case. append don't work. And each csv file was saved in news/test1 with part-00000..., part-00000..., part-00000...

Second I union dataframe

df = spark.sql(...)
df_total = df_total.union(df)

df_total.write \
  .option("header","true") \
  .option("encoding", "UTF-8") \
  .mode("append") \
  .option("extracopyoptions", "TIMEFORMAT 'auto'") \
  .csv("s3a://news/test2")

But in this case, even I made the one dataframe, each csv file was saved in news/test2 with part-00000..., part-00001..., part-00002...

How can I save only one csv file to s3?

I need your help

apache-spark amazon-s3 pyspark