Status: Active
Created: November 2025
Last Updated: November 2025
Summary
To position Polar for ingesting an increased number of events (100-1000x current number of events), and to minimize the number of data migrations that we will need to do the ambition of this document is to outline three alternatives for storing event data based on the current and imagined future query patterns.Goals
- List and filter events speedily (<200 ms) independent on number of events associated with organization.
- Filter should be on arbitrary hierarchies of events.
- Aggregate arbitrary fields in event hierarchies, such as duration or cost.
- Generate metrics from event lists
- Window events over time
- Group events by arbitrary properties of events (name, description)
- Events are immutable
- Events can be ingested at any point in time (not only realtime)
- We want to be able add more metrics and query patterns in the future
- We want our users to be able to ingest a lot of metadata that they can query in the future
Current state
Events
Events are stored in anevents table. The relationship / hierarchy between events are stored via the self referencing parent_id field, which refers to the ID of a different event in the same table. To allow for speedier lookups of events in an hierarchy there is a separate events_closure table which stores the relationship between two events (ancestor and descendant as well as the depth of which that relationship is.).
The event_types table exists to allow us to group events that are of the same type and to assign a customizable label to them. If you ingest two events that are called “Support Request”, they are semantically identical even though they are separate events that have happened at different points in time (potentially to different customers).
Event Closure
What the diagram above aims to show is that the further down the hierarchy you go, the more entries into the closure table you will get, since each corresponding parent (or ancestral) relationship will be represented. This means that when you select all of an event which whereancestor_id = 1 you will get itself and all of its children, no mater which generation, as well as the depth from the parent they have.
Query patterns
The events are queried by joining the three tables:depth = 1, etc.
Metrics
Metrics are queried by building up aseries of intervals based on the queried interval. Additionally we build a number of common table expressions (CTEs) all of which are then queried together.
Alternatives
1. Keep Postgres and current event, metric structures
The current setup will hold up for some time forward. Since we are using a combination of querying the state tables as well as the events, and that we are always filtering byorganization_id in the events filtering, we will most likely not have an entirely unmanageable growth. Some response times will start to creep up but we might be able to be smarter with the way we query data.
2. Expand events to include all data necessary to generate metrics
Today we are lacking some specific events, and some data in the events to be able to build out the full (or almost full) metrics from purely the events. We would need to add some information to the system events, and probably backfill the existing events so that we can move forward without having to consider before-and-after this migration.Events
Enrichingorder.paid
currency- total_amount information
net_amount,tax_amount,applied_balance_amount
- discount information
discount_amount,discount_id
platform_fee- subscription information
subscription_id,subscription_type(one time, monthly, yearly)
subscription.created
amountrecurring_interval,recurring_interval_countstarted_at,product_id
subscription.canceled
- To abe able to differnate churn and cancelation metrics.
cancellation_reason,cancellation_commentcanceled_at,ends_at
subscription.revoked
amountto keep track of the amount reductionrecurring_interval,recurring_interval_count
checkout.created
Adding checkout.updated
Metrics
Given the new events we have added, we should be able to migrate to a fully event-based metrics generation if we wish. Certain metrics might be more involved than others. A metrics query that only looks as events could look something like the following:2A Enabling TimescaleDB
Metrics
Building on the work for alternative 2, we can enableTimescaleDB as an extension in postgres. Then we can set up materialized views with time buckets, such as:
System metrics
User metrics
Events
The querying of events will take a slight hit with TimescaleDB. The following applies to hypertables in TimescaleDB:- Foreign key constraints from a hypertable referencing a regular table
- Foreign key constraints from a regular table referencing a hypertable
- Foreign keys from a hypertable referencing another hypertable are not supported.
parent_id and root_id foreign keys on the events table if we convert it to a TimescaleDB hypertable.
However we will be able to have a foreign key relationship between events and events_closure. We probably will want to introduce timestamps on the events closure table since that is where TimescaleDB also shines:
2B Moving over to Clickhouse
Based on us having full event coverage, we could denormalize (or just expand the JSONB metadata column into its own columns). Additionally we would denormalize the event_closure table. The Clickhouse event table would then look something similar to:Events
Selecting an event and all of its children would look like:Metrics
Querying metric data from Clickhouse could look something similar to:- More limited SQLAlchemy ORM magic in Clickhouse. There is support to use SQLAlchemy with Clickhouse, but it does not support everything.
- We can no longer join events and other tables in the same query. Independent on if we continue to store events in Postgres or if we fully migrate them to Clickhouse, we can’t join the data in the same query.
- Dual infrastructure / Operations. We will need to maintain two databases, ensure that they are kept up-to-date, etc. It will add some complexity to our operations.
Comparison
For certain organizations we are at >5-10 seconds response times for metrics queries today. Similarly for some organizations we are at >2s for event listing today. This hints that the current set up will not continue to scale with our customers for a lot longer, and we will not be able to onboard even larger customers that are ingesting more event data. Clickhouse is well understood in the industry to be extremely performant, albeit at a cost of heavier operations. Cloudflare1 has been using it for a long time. Posthog2 are also using it and are very happy about it3:Using ClickHouse is one of the best decisions we’ve made here at PostHog. It has allowed us to scale performance all while allowing us to build more products on the same set of data.There are different comparisons and benchmarks to find. Tiger Data4 and Tinybird5 have done different benchmarks.
link.
Quoted from TigerDatas4 comparison, leads us to think about what the main patterns of querying we will be doing. We know that we will want to generate metrics from our events (i.e. aggregate and group a lot of events), and we know that we will want to ingest a lot of events from customers externally. 1. Cloudflare 2. Posthog 3. Posthog on HackerNews 4. TigerData 5. Tinybird
- TimescaleDB is 1.9x faster than ClickHouse on RTABench, even though it’s 6.8x slower on ClickBench. This is likely because TimescaleDB is optimized for real-time analytics applications, which often rely on normalized schemas and selective aggregations, while ClickHouse shines in denormalized, columnar analytics with large-scale aggregations.
- Incremental materialized views offer massive speedups. They deliver up to hundreds or even thousands of times faster performance than querying the raw data (from seconds to a few milliseconds), demonstrating their value for real-time analytics. However, among the databases tested, only ClickHouse and TimescaleDB support them.
- ClickHouse is the leader in data loading and storage efficiency. It’s 4.8x faster at loading data and uses 1.7x less disk than the next best database.

