I am loading data using pyspark with spark_reader.load(data_path)
However, in some cases data can be very messy, with fields using different case for each rows (can be in nested structs).
Here is an example of data :
[
{
"field_1": "1",
"Field_2": 1,
"field_3": "b",
"field_4": [{"A": 1, "b": 2}, {"A": 3, "b": 4}],
},
{
"Field_1": "2",
"Field_2": 2,
"Field_3": "BB",
"Field_4": [{"a": 1, "B": 2}, {"a": 3, "B": 4}],
},
]
In this case, the load fails with following error :
pyspark.sql.utils.AnalysisException: Found duplicate column(s) in the data schema: `field_1`, `field_3`, `field_4`
And I can't find a clean way to handle this case. I tried the following workaround :
raw = spark.read.text(data_path)
normalized_rdd = raw.rdd.mapPartitions(_normalize_partition)
raw_df = spark.read.json(normalized_rdd)
With a python function _normalize_partition that normalizes the column names. However it does not work in my case as I use a Databricks serverless compute.
[NOT_IMPLEMENTED] Using custom code using PySpark RDDs is not allowed on serverless compute.