Why is AWS Glue job creating parition file names with 'unnamed' included in the file's name?
21:07 31 Jul 2022

We are using an AWS Glue job to load and de-dupe data and we are making a change to no longer use the crawler to determine schema meta data - we are now explicitly defining it.

As a result, we are using AWS's recommended method 2 (see below)

https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html

sink = glueContext.getSink(connection_type="s3",
                           path=tgt_path,
                           enableUpdateCatalog=True,
                           partitionKeys=partition_key)
sink.setFormat("glueparquet")
sink.setCatalogInfo(catalogDatabase=tgt_db, catalogTableName=tgt_table)
sink.writeFrame(last_transform)

We use this code in two separate jobs. The first job writes partition files with the following naming convention:

  • run-timestamp-part-block-0-0-r-someNumber-snappy.parquet

Example: run-1659269628417-part-block-0-0-r-00001-snappy.parquet

However, the second job is writing the files with the following naming convention:

  • run-unnamed-36-part-block-0-0-r-someNumber-snappy.parquet

Example: run-unnamed-36-part-block-0-0-r-00001.snappy.parquet

Does anyone know why unnamed is being applied to the file name as opposed to a timestamp? I have searched AWS's documentation, but have not had much success in getting an answer. The below link indicates that it is not possible to specify the target name on-the-fly - the file name can only be changed afterwards.

Note: the data in the unnamed file appears to be accurate.

AWS Glue Job Output File Name

Any help is appreciated.

amazon-web-services amazon-s3 aws-glue amazon-athena aws-glue-data-catalog