I’m designing a high-volume analytics system and I’m trying to choose the data model before the dataset becomes too large to change easily.
Current setup:
Custom app events are stored in ClickHouse
Device and user records are stored in MongoDB
MongoDB is where current device/user dimensions live, such as:
os_name
os_version
device_model
app_version
current user_uuid
ClickHouse is the event store
Important concepts:
device_uuid
user_uuid
client event timestamp
server-side received_at
Important product behavior:
A device can start as anonymous
For example, at first
user_uuid = device_uuidLater the same device can become a real user
For example,
user_uuid = JohnWhen that happens, past anonymous events should also be attributed to the real user
Because of that, storing user_uuid directly in the huge ClickHouse events table creates expensive update/mutation problems.
So one design I’m considering is:
events table in ClickHouse, immutable:
project_id
device_uuid
timestamp
received_at
event_code
session_id
properties
Separate ownership table in ClickHouse:
project_id
device_uuid
valid_from
valid_to
user_uuid
Then user resolution is done by joining events to ownership intervals using device_uuid + received_at.
Short example:
10:00 device D1 is anonymous, so owner is D1
10:05 and 10:06 events are inserted for device D1
10:10 user logs in, now owner becomes John
Those past anonymous events should also appear under John
My concern is not ingestion. My concern is analytics later.
For example, I may need queries like:
unique users who did app_open -> permission_granted
windowFunnel by user
retention / cohort analysis by user
“show events from iOS 18.6 devices for users who later converted”
segment by event properties + resolved user + device/app dimensions
high-volume dashboards with many users and many events
Questions:
Is keeping user_uuid out of the main events table still the right approach at large scale?
Is a separate ownership table in ClickHouse a practical production-ready model for this?
If device/app/os dimensions stay in MongoDB while events stay in ClickHouse, is that a bad sign for analytics?
For example, would queries like “events from iOS 18.6 devices for converted users” become awkward because they depend on both MongoDB dimensions and ClickHouse events?
Notes:
I want to choose the most optimized approach in terms of RAM, CPU, and disk usage.
I also considered storing `user_uuid` directly in the `events` table and using ClickHouse lightweight updates when updates are needed, but since that feature is still experimental, it feels risky.
The strange part is that Claude, ChatGPT, and Gemini all suggested the separate ownership table approach above, but I couldn’t find any article or documentation describing a similar architecture specifically for ClickHouse.