Hugging Face applying Transformation on nested to datasets without loading into memory
I am trying to apply below transformation for preparing my datasets for fine tuning using unsloth huggingface. It requires the dataset to be in following format.
def convert_to_conversation(sample):
instruction = "OCR the image into markdown format"
# Get raw image bytes directly
img_bytes = sample["image"]
# Convert bytes to base64 string
img_b64 = base64.b64encode(img_bytes).decode("utf-8")
img_data_uri = f"data:image/png;base64,{img_b64}"
return {
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": instruction},
{"type": "image", "image": img_data_uri}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": sample["markdown"]}
]
},
]
}
However, when I tried to apply this function using datasets.map
mapped_dataset = dataset.map(
convert_to_conversation,
remove_columns=dataset.column_names, # This removes all original columns
batched=False
)
This is the output:
{'messages': [{'content': [{'image': None,
'text': 'OCR the image into markdown format',
'type': 'text'},
{'image': 'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAABTsAAAk7CAIAAACuiIx1AAEAAElEQVR4nOzdd3wT5R8H8CerSZruAS0FSlldtEwZZZRRVtkIyPqBgEWWICKCoC
It changes the structure. If I use list comprehension, I run out of memory. So, how can I transform the datasets without loading all into memory? Thanks