How to parse fuzzy text descriptions into structured time-series data in Python?
I am extracting streaming subscriber data from text using an LLM, and I get results like this:
{
"raw_extractions": [
{
"platform_mention": "Netflix",
"year_mention": "2012",
"subscriber_mention": "roughly 30 million subscribers worldwide"
},
{
"platform_mention": "Netflix",
"year_mention": "2020",
"subscriber_mention": "just under 200 million"
},
{
"platform_mention": "Netflix",
"year_mention": "2022",
"subscriber_mention": "hovered around 220 million subscribers"
}
]
}
I need to convert this into clean time-series data for analysis:
| year | platform | subscribers_min | subscribers_max | confidence |
|------|----------|----------------|-----------------|------------|
| 2012 | Netflix | 30 | 30 | medium |
| 2020 | Netflix | 195 | 200 | medium |
| 2022 | Netflix | 220 | 220 | medium |
What is the best Python approach to parse fuzzy phrases like "roughly 30 million", "just under 200 million" into numeric ranges?