We are using an AWS Glue job to load and de-dupe data and we are making a change to no longer use the crawler to determine schema meta data - we are now explicitly defining it.
As a result, we are using AWS's recommended method 2 (see below)
https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html
sink = glueContext.getSink(connection_type="s3",
path=tgt_path,
enableUpdateCatalog=True,
partitionKeys=partition_key)
sink.setFormat("glueparquet")
sink.setCatalogInfo(catalogDatabase=tgt_db, catalogTableName=tgt_table)
sink.writeFrame(last_transform)
We use this code in two separate jobs. The first job writes partition files with the following naming convention:
- run-timestamp-part-block-0-0-r-someNumber-snappy.parquet
Example: run-1659269628417-part-block-0-0-r-00001-snappy.parquet
However, the second job is writing the files with the following naming convention:
- run-unnamed-36-part-block-0-0-r-someNumber-snappy.parquet
Example: run-unnamed-36-part-block-0-0-r-00001.snappy.parquet
Does anyone know why unnamed is being applied to the file name as opposed to a timestamp? I have searched AWS's documentation, but have not had much success in getting an answer. The below link indicates that it is not possible to specify the target name on-the-fly - the file name can only be changed afterwards.
Note: the data in the unnamed file appears to be accurate.
Any help is appreciated.