Handle case issue in column names

07:17 27 May 2026

I am loading data using pyspark with spark_reader.load(data_path)

However, in some cases data can be very messy, with fields using different case for each rows (can be in nested structs).
Here is an example of data :

[
    {
        "field_1": "1",
        "Field_2": 1,
        "field_3": "b",
        "field_4": [{"A": 1, "b": 2}, {"A": 3, "b": 4}],
    },
    {
        "Field_1": "2",
        "Field_2": 2,
        "Field_3": "BB",
        "Field_4": [{"a": 1, "B": 2}, {"a": 3, "B": 4}],
    },
]

In this case, the load fails with following error :

pyspark.sql.utils.AnalysisException: Found duplicate column(s) in the data schema: `field_1`, `field_3`, `field_4`

And I can't find a clean way to handle this case. I tried the following workaround :

raw = spark.read.text(data_path)
normalized_rdd = raw.rdd.mapPartitions(_normalize_partition)
raw_df = spark.read.json(normalized_rdd)

With a python function _normalize_partition that normalizes the column names. However it does not work in my case as I use a Databricks serverless compute.

[NOT_IMPLEMENTED] Using custom code using PySpark RDDs is not allowed on serverless compute.

apache-spark pyspark databricks

Your Answer

Privacy & Cookie Consent