How to group connected materials and aggregate stocks in PySpark/SQL on Databricks
00:31 05 May 2026

I just want to ask if you have any idea or advice on how I can solve my current issue. I'll give you some sample raw data and expected data below. The process can use SQL and Python, currently using a Databricks notebook.

As you can see in the data, I don't have a direct link between all of the materials. I have A1, B3 — any material is fine as long as it's within the same group. But in the group column, if I already have group A1 even though I add another material within that group, it still needs to be A1. I want to retain it. So my solution is to save it into ADLS and for the STOCKS column I want to aggregate the stocks also within the same group.

Sample:
enter image description here

Any help would be greatly appreciated. Thanks!

python pyspark databricks data-engineering