Paradime | APPROX_COUNT

SQL Keywords

APPROX_COUNT_DISTINCT

Feb 23, 2026

min read

Category: Aggregate Functions

Platform Support:

✅ Snowflake (HLL or APPROX_COUNT_DISTINCT) | ✅ BigQuery (APPROX_COUNT_DISTINCT) | ✅ Databricks (APPROX_COUNT_DISTINCT)

Description

Returns an approximate count of distinct values using the HyperLogLog algorithm. This function is significantly faster than exact COUNT(DISTINCT) for large datasets, with typical accuracy of ~98% (2% error rate). Essential for analyzing large-scale data where exact precision isn't critical.

Syntax by Platform

Snowflake:

APPROX_COUNT_DISTINCT(column)
-- OR
HLL(column)  -- Snowflake-specific, same functionality

APPROX_COUNT_DISTINCT(column)
-- OR
HLL(column)  -- Snowflake-specific, same functionality

APPROX_COUNT_DISTINCT(column)
-- OR
HLL(column)  -- Snowflake-specific, same functionality

BigQuery:

APPROX_COUNT_DISTINCT(column)

APPROX_COUNT_DISTINCT(column)

APPROX_COUNT_DISTINCT(column)

Databricks:

APPROX_COUNT_DISTINCT(column)

APPROX_COUNT_DISTINCT(column)

APPROX_COUNT_DISTINCT(column)

Platform-Specific Notes

Snowflake:

Original function name is HLL (HyperLogLog)
APPROX_COUNT_DISTINCT added for SQL standard compatibility
Both functions work identically
Extremely efficient on large datasets

BigQuery:

Standard function across Google Cloud
Can be combined with HLL_COUNT functions for more control
Supports APPROX_TOP_COUNT for approximate top-k values

Databricks:

Standard Spark SQL function
Can specify precision with second parameter (default: 0.05 = 5% error)
Compatible with Delta Lake optimizations

Example 1: Count Unique Visitors

All Platforms:

SELECT 
    DATE_TRUNC('DAY', event_timestamp) as date,
    APPROX_COUNT_DISTINCT(user_id) as unique_visitors,
    COUNT(*) as total_page_views,
    COUNT(*) / APPROX_COUNT_DISTINCT(user_id) as avg_views_per_user
FROM page_views
GROUP BY DATE_TRUNC('DAY', event_timestamp)
ORDER BY

SELECT 
    DATE_TRUNC('DAY', event_timestamp) as date,
    APPROX_COUNT_DISTINCT(user_id) as unique_visitors,
    COUNT(*) as total_page_views,
    COUNT(*) / APPROX_COUNT_DISTINCT(user_id) as avg_views_per_user
FROM page_views
GROUP BY DATE_TRUNC('DAY', event_timestamp)
ORDER BY

SELECT 
    DATE_TRUNC('DAY', event_timestamp) as date,
    APPROX_COUNT_DISTINCT(user_id) as unique_visitors,
    COUNT(*) as total_page_views,
    COUNT(*) / APPROX_COUNT_DISTINCT(user_id) as avg_views_per_user
FROM page_views
GROUP BY DATE_TRUNC('DAY', event_timestamp)
ORDER BY

Sample Data (page_views table):

event_timestamp	user_id	page_url
2024-01-15 09:00	user_1	/home
2024-01-15 09:15	user_2	/products
2024-01-15 10:00	user_1	/about
2024-01-15 11:00	user_3	/home
2024-01-15 14:00	user_1	/contact
2024-01-16 09:00	user_1	/home
2024-01-16 10:00	user_4	/products
2024-01-16 11:00	user_2	/home

Result:

date	unique_visitors	total_page_views	avg_views_per_user
2024-01-15	3	5	1.67
2024-01-16	3	3	1.00

Example 2: Compare Exact vs Approximate

All Platforms:

SELECT 
    category,
    COUNT(DISTINCT customer_id) as exact_count,
    APPROX_COUNT_DISTINCT(customer_id) as approx_count,
    ABS(COUNT(DISTINCT customer_id) - APPROX_COUNT_DISTINCT(customer_id)) as difference,
    (ABS(COUNT(DISTINCT customer_id) - APPROX_COUNT_DISTINCT(customer_id)) * 100.0 
     / COUNT(DISTINCT customer_id)) as error_rate_pct
FROM purchases
GROUP BY

SELECT 
    category,
    COUNT(DISTINCT customer_id) as exact_count,
    APPROX_COUNT_DISTINCT(customer_id) as approx_count,
    ABS(COUNT(DISTINCT customer_id) - APPROX_COUNT_DISTINCT(customer_id)) as difference,
    (ABS(COUNT(DISTINCT customer_id) - APPROX_COUNT_DISTINCT(customer_id)) * 100.0 
     / COUNT(DISTINCT customer_id)) as error_rate_pct
FROM purchases
GROUP BY

SELECT 
    category,
    COUNT(DISTINCT customer_id) as exact_count,
    APPROX_COUNT_DISTINCT(customer_id) as approx_count,
    ABS(COUNT(DISTINCT customer_id) - APPROX_COUNT_DISTINCT(customer_id)) as difference,
    (ABS(COUNT(DISTINCT customer_id) - APPROX_COUNT_DISTINCT(customer_id)) * 100.0 
     / COUNT(DISTINCT customer_id)) as error_rate_pct
FROM purchases
GROUP BY

Sample Data (purchases table with 1M+ rows):

category	Exact Count	Approx Count	Difference	Error Rate %
Electronics	45,231	45,189	42	0.09%
Clothing	78,542	78,601	59	0.08%
Books	23,891	23,847	44	0.18%
Home & Garden	34,678	34,712	34	0.10%

Note: Approximate counts are typically within 2% of exact values

Example 3: Multi-Dimensional Analysis

All Platforms:

SELECT 
    country,
    device_type,
    APPROX_COUNT_DISTINCT(user_id) as unique_users,
    APPROX_COUNT_DISTINCT(session_id) as unique_sessions,
    COUNT(*) as total_events,
    COUNT(*) / APPROX_COUNT_DISTINCT(session_id) as events_per_session
FROM user_events
WHERE event_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY country, device_type
ORDER BY unique_users DESC
LIMIT 10

SELECT 
    country,
    device_type,
    APPROX_COUNT_DISTINCT(user_id) as unique_users,
    APPROX_COUNT_DISTINCT(session_id) as unique_sessions,
    COUNT(*) as total_events,
    COUNT(*) / APPROX_COUNT_DISTINCT(session_id) as events_per_session
FROM user_events
WHERE event_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY country, device_type
ORDER BY unique_users DESC
LIMIT 10

SELECT 
    country,
    device_type,
    APPROX_COUNT_DISTINCT(user_id) as unique_users,
    APPROX_COUNT_DISTINCT(session_id) as unique_sessions,
    COUNT(*) as total_events,
    COUNT(*) / APPROX_COUNT_DISTINCT(session_id) as events_per_session
FROM user_events
WHERE event_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY country, device_type
ORDER BY unique_users DESC
LIMIT 10

Sample Data (user_events table):

country	device_type	user_id	session_id	event_date
USA	mobile	u1	s1	2024-01-15
USA	mobile	u1	s1	2024-01-15
USA	desktop	u2	s2	2024-01-15
UK	mobile	u3	s3	2024-01-15
USA	mobile	u4	s4	2024-01-16
UK	tablet	u3	s5	2024-01-16

Result:

country	device_type	unique_users	unique_sessions	total_events	events_per_session
USA	mobile	2	2	3	1.50
USA	desktop	1	1	1	1.00
UK	mobile	1	1	1	1.00
UK	tablet	1	1	1	1.00

Example 4: Platform-Specific Features

Snowflake (Using HLL):

-- Snowflake supports both function names
SELECT 
    product_category,
    HLL(customer_id) as hll_count,  -- Original Snowflake function
    APPROX_COUNT_DISTINCT(customer_id) as approx_count  -- Standard SQL
FROM sales
GROUP BY

-- Snowflake supports both function names
SELECT 
    product_category,
    HLL(customer_id) as hll_count,  -- Original Snowflake function
    APPROX_COUNT_DISTINCT(customer_id) as approx_count  -- Standard SQL
FROM sales
GROUP BY

-- Snowflake supports both function names
SELECT 
    product_category,
    HLL(customer_id) as hll_count,  -- Original Snowflake function
    APPROX_COUNT_DISTINCT(customer_id) as approx_count  -- Standard SQL
FROM sales
GROUP BY

Databricks (With Custom Precision):

-- Databricks allows specifying error rate
SELECT 
    region,
    APPROX_COUNT_DISTINCT(user_id) as default_precision,  -- ~5% error
    APPROX_COUNT_DISTINCT(user_id, 0.01) as high_precision  -- ~1% error (slower)
FROM user_activity
GROUP BY

-- Databricks allows specifying error rate
SELECT 
    region,
    APPROX_COUNT_DISTINCT(user_id) as default_precision,  -- ~5% error
    APPROX_COUNT_DISTINCT(user_id, 0.01) as high_precision  -- ~1% error (slower)
FROM user_activity
GROUP BY

-- Databricks allows specifying error rate
SELECT 
    region,
    APPROX_COUNT_DISTINCT(user_id) as default_precision,  -- ~5% error
    APPROX_COUNT_DISTINCT(user_id, 0.01) as high_precision  -- ~1% error (slower)
FROM user_activity
GROUP BY

BigQuery (With HLL Sketches):

-- BigQuery HLL functions for advanced use cases
SELECT 
    date,
    APPROX_COUNT_DISTINCT(user_id) as unique_users,
    HLL_COUNT.MERGE(HLL_COUNT.INIT(user_id)) as hll_merge_count
FROM daily_users
GROUP BY

-- BigQuery HLL functions for advanced use cases
SELECT 
    date,
    APPROX_COUNT_DISTINCT(user_id) as unique_users,
    HLL_COUNT.MERGE(HLL_COUNT.INIT(user_id)) as hll_merge_count
FROM daily_users
GROUP BY

-- BigQuery HLL functions for advanced use cases
SELECT 
    date,
    APPROX_COUNT_DISTINCT(user_id) as unique_users,
    HLL_COUNT.MERGE(HLL_COUNT.INIT(user_id)) as hll_merge_count
FROM daily_users
GROUP BY

Performance Comparison

Dataset Size	COUNT(DISTINCT)	APPROX_COUNT_DISTINCT	Speed Improvement
1 Million rows	2.5 seconds	0.3 seconds	8x faster
10 Million rows	28 seconds	0.8 seconds	35x faster
100 Million rows	4.5 minutes	3 seconds	90x faster
1 Billion rows	45 minutes	12 seconds	225x faster

Performance gains increase with data size

Accuracy Characteristics

Typical Error Rate: ±2% (98% accurate)
Algorithm: HyperLogLog (HLL)
Memory Usage: Fixed, very small (~12KB per distinct value calculation)
Consistency: Results are deterministic for same data

When to Use APPROX_COUNT_DISTINCT

Use when:

✅ Analyzing large datasets (millions+ rows)
✅ Exact precision not required (analytics, dashboards)
✅ Query performance is critical
✅ Counting distinct users, sessions, devices, IPs
✅ Real-time analytics requiring fast results

Use COUNT(DISTINCT) when:

❌ Small datasets (< 100k rows) where performance difference is minimal
❌ Financial calculations requiring exact accuracy
❌ Compliance/audit scenarios
❌ Small cardinality (< 1000 distinct values)

Common Use Cases

Web Analytics: Unique visitors, sessions per day
E-commerce: Unique customers, product views
Marketing: Campaign reach, unique email opens
IoT: Unique devices, sensor readings
Gaming: Daily active users (DAU), monthly active users (MAU)
Social Media: Unique post impressions, engagement

Best Practices

Use for large datasets: 100k+ rows where performance matters
Document approximation: Let stakeholders know counts are approximate
Benchmark accuracy: Test on your data to understand error rates
Combine with sampling: Further improve performance on massive datasets
Use for trending: Perfect for dashboards showing trends over time
Avoid mixing: Don't mix exact and approximate counts in comparisons

Related Functions

COUNT(DISTINCT) - Exact distinct count (slower)
APPROX_PERCENTILE - Approximate percentile calculations
APPROX_TOP_COUNT (BigQuery) - Approximate top-k frequent values
HLL_COUNT functions (BigQuery) - Advanced HyperLogLog operations

Interested to Learn More?
Try Out the Free 14-Days Trial

Start free trial

Learn

Feb 18, 2026

BigQuery Global Queries: How to Run Cross-Region SQL in 2026

Learn

Feb 18, 2026

BigQuery Global Queries: How to Run Cross-Region SQL in 2026

Learn

Feb 18, 2026

BigQuery Global Queries: How to Run Cross-Region SQL in 2026

Analytics

Feb 18, 2026

Context Engineering and AI Quality for Data Teams

Analytics

Feb 18, 2026

Context Engineering and AI Quality for Data Teams

Analytics

Feb 18, 2026

Context Engineering and AI Quality for Data Teams

Product

Feb 18, 2026

Accelerate Analytics Development with Paradime and Tableau

Product

Feb 18, 2026

Accelerate Analytics Development with Paradime and Tableau

Product

Feb 18, 2026

Accelerate Analytics Development with Paradime and Tableau

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Start for free

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Start for free

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Start for free

Platform

Resources

ADD-ONs

Industries

About

Legal

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Start for free

Platform

Resources