Hugging Face applying Transformation on nested to datasets without loading into memory

06:51 04 Jul 2025

I am trying to apply below transformation for preparing my datasets for fine tuning using unsloth huggingface. It requires the dataset to be in following format.

def convert_to_conversation(sample):
    instruction = "OCR the image into markdown format"

    # Get raw image bytes directly
    img_bytes = sample["image"]

    # Convert bytes to base64 string
    img_b64 = base64.b64encode(img_bytes).decode("utf-8")
    img_data_uri = f"data:image/png;base64,{img_b64}"

    return {
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": instruction},
                    {"type": "image", "image": img_data_uri}
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {"type": "text", "text": sample["markdown"]}
                ]
            },
        ]
    }

However, when I tried to apply this function using datasets.map

mapped_dataset = dataset.map(
    convert_to_conversation,
    remove_columns=dataset.column_names,  # This removes all original columns
    batched=False
)

This is the output:

{'messages': [{'content': [{'image': None,
     'text': 'OCR the image into markdown format',
     'type': 'text'},
    {'image': 'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAABTsAAAk7CAIAAACuiIx1AAEAAElEQVR4nOzdd3wT5R8H8CerSZruAS0FSlldtEwZZZRRVtkIyPqBgEWWICKCoC

It changes the structure. If I use list comprehension, I run out of memory. So, how can I transform the datasets without loading all into memory? Thanks

dataset huggingface huggingface-datasets

Your Answer

Privacy & Cookie Consent