APPROX_COUNT_DISTINCT
Feb 23, 2026
·
5
min read
Category: Aggregate Functions
Platform Support:
✅ Snowflake (HLL or APPROX_COUNT_DISTINCT) | ✅ BigQuery (APPROX_COUNT_DISTINCT) | ✅ Databricks (APPROX_COUNT_DISTINCT)
Description
Returns an approximate count of distinct values using the HyperLogLog algorithm. This function is significantly faster than exact COUNT(DISTINCT) for large datasets, with typical accuracy of ~98% (2% error rate). Essential for analyzing large-scale data where exact precision isn't critical.
Syntax by Platform
Snowflake:
BigQuery:
Databricks:
Platform-Specific Notes
Snowflake:
Original function name is
HLL(HyperLogLog)APPROX_COUNT_DISTINCTadded for SQL standard compatibilityBoth functions work identically
Extremely efficient on large datasets
BigQuery:
Standard function across Google Cloud
Can be combined with HLL_COUNT functions for more control
Supports APPROX_TOP_COUNT for approximate top-k values
Databricks:
Standard Spark SQL function
Can specify precision with second parameter (default: 0.05 = 5% error)
Compatible with Delta Lake optimizations
Example 1: Count Unique Visitors
All Platforms:
Sample Data (page_views table):
event_timestamp | user_id | page_url |
|---|---|---|
2024-01-15 09:00 | user_1 | /home |
2024-01-15 09:15 | user_2 | /products |
2024-01-15 10:00 | user_1 | /about |
2024-01-15 11:00 | user_3 | /home |
2024-01-15 14:00 | user_1 | /contact |
2024-01-16 09:00 | user_1 | /home |
2024-01-16 10:00 | user_4 | /products |
2024-01-16 11:00 | user_2 | /home |
Result:
date | unique_visitors | total_page_views | avg_views_per_user |
|---|---|---|---|
2024-01-15 | 3 | 5 | 1.67 |
2024-01-16 | 3 | 3 | 1.00 |
Example 2: Compare Exact vs Approximate
All Platforms:
Sample Data (purchases table with 1M+ rows):
category | Exact Count | Approx Count | Difference | Error Rate % |
|---|---|---|---|---|
Electronics | 45,231 | 45,189 | 42 | 0.09% |
Clothing | 78,542 | 78,601 | 59 | 0.08% |
Books | 23,891 | 23,847 | 44 | 0.18% |
Home & Garden | 34,678 | 34,712 | 34 | 0.10% |
Note: Approximate counts are typically within 2% of exact values
Example 3: Multi-Dimensional Analysis
All Platforms:
Sample Data (user_events table):
country | device_type | user_id | session_id | event_date |
|---|---|---|---|---|
USA | mobile | u1 | s1 | 2024-01-15 |
USA | mobile | u1 | s1 | 2024-01-15 |
USA | desktop | u2 | s2 | 2024-01-15 |
UK | mobile | u3 | s3 | 2024-01-15 |
USA | mobile | u4 | s4 | 2024-01-16 |
UK | tablet | u3 | s5 | 2024-01-16 |
Result:
country | device_type | unique_users | unique_sessions | total_events | events_per_session |
|---|---|---|---|---|---|
USA | mobile | 2 | 2 | 3 | 1.50 |
USA | desktop | 1 | 1 | 1 | 1.00 |
UK | mobile | 1 | 1 | 1 | 1.00 |
UK | tablet | 1 | 1 | 1 | 1.00 |
Example 4: Platform-Specific Features
Snowflake (Using HLL):
Databricks (With Custom Precision):
BigQuery (With HLL Sketches):
Performance Comparison
Dataset Size | COUNT(DISTINCT) | APPROX_COUNT_DISTINCT | Speed Improvement |
|---|---|---|---|
1 Million rows | 2.5 seconds | 0.3 seconds | 8x faster |
10 Million rows | 28 seconds | 0.8 seconds | 35x faster |
100 Million rows | 4.5 minutes | 3 seconds | 90x faster |
1 Billion rows | 45 minutes | 12 seconds | 225x faster |
Performance gains increase with data size
Accuracy Characteristics
Typical Error Rate: ±2% (98% accurate)
Algorithm: HyperLogLog (HLL)
Memory Usage: Fixed, very small (~12KB per distinct value calculation)
Consistency: Results are deterministic for same data
When to Use APPROX_COUNT_DISTINCT
Use when:
✅ Analyzing large datasets (millions+ rows)
✅ Exact precision not required (analytics, dashboards)
✅ Query performance is critical
✅ Counting distinct users, sessions, devices, IPs
✅ Real-time analytics requiring fast results
Use COUNT(DISTINCT) when:
❌ Small datasets (< 100k rows) where performance difference is minimal
❌ Financial calculations requiring exact accuracy
❌ Compliance/audit scenarios
❌ Small cardinality (< 1000 distinct values)
Common Use Cases
Web Analytics: Unique visitors, sessions per day
E-commerce: Unique customers, product views
Marketing: Campaign reach, unique email opens
IoT: Unique devices, sensor readings
Gaming: Daily active users (DAU), monthly active users (MAU)
Social Media: Unique post impressions, engagement
Best Practices
Use for large datasets: 100k+ rows where performance matters
Document approximation: Let stakeholders know counts are approximate
Benchmark accuracy: Test on your data to understand error rates
Combine with sampling: Further improve performance on massive datasets
Use for trending: Perfect for dashboards showing trends over time
Avoid mixing: Don't mix exact and approximate counts in comparisons
Related Functions
COUNT(DISTINCT)- Exact distinct count (slower)APPROX_PERCENTILE- Approximate percentile calculationsAPPROX_TOP_COUNT(BigQuery) - Approximate top-k frequent valuesHLL_COUNTfunctions (BigQuery) - Advanced HyperLogLog operations





