SQL Keywords

SQL Keywords

APPROX_COUNT_DISTINCT

Feb 23, 2026

·

5

min read

Category: Aggregate Functions

Platform Support:

✅ Snowflake (HLL or APPROX_COUNT_DISTINCT) | ✅ BigQuery (APPROX_COUNT_DISTINCT) | ✅ Databricks (APPROX_COUNT_DISTINCT)

Description

Returns an approximate count of distinct values using the HyperLogLog algorithm. This function is significantly faster than exact COUNT(DISTINCT) for large datasets, with typical accuracy of ~98% (2% error rate). Essential for analyzing large-scale data where exact precision isn't critical.

Syntax by Platform

Snowflake:

APPROX_COUNT_DISTINCT(column)
-- OR
HLL(column)  -- Snowflake-specific, same functionality
APPROX_COUNT_DISTINCT(column)
-- OR
HLL(column)  -- Snowflake-specific, same functionality
APPROX_COUNT_DISTINCT(column)
-- OR
HLL(column)  -- Snowflake-specific, same functionality

BigQuery:

APPROX_COUNT_DISTINCT(column)
APPROX_COUNT_DISTINCT(column)
APPROX_COUNT_DISTINCT(column)

Databricks:

APPROX_COUNT_DISTINCT(column)
APPROX_COUNT_DISTINCT(column)
APPROX_COUNT_DISTINCT(column)

Platform-Specific Notes

Snowflake:

  • Original function name is HLL (HyperLogLog)

  • APPROX_COUNT_DISTINCT added for SQL standard compatibility

  • Both functions work identically

  • Extremely efficient on large datasets

BigQuery:

  • Standard function across Google Cloud

  • Can be combined with HLL_COUNT functions for more control

  • Supports APPROX_TOP_COUNT for approximate top-k values

Databricks:

  • Standard Spark SQL function

  • Can specify precision with second parameter (default: 0.05 = 5% error)

  • Compatible with Delta Lake optimizations

Example 1: Count Unique Visitors

All Platforms:

SELECT 
    DATE_TRUNC('DAY', event_timestamp) as date,
    APPROX_COUNT_DISTINCT(user_id) as unique_visitors,
    COUNT(*) as total_page_views,
    COUNT(*) / APPROX_COUNT_DISTINCT(user_id) as avg_views_per_user
FROM page_views
GROUP BY DATE_TRUNC('DAY', event_timestamp)
ORDER BY

SELECT 
    DATE_TRUNC('DAY', event_timestamp) as date,
    APPROX_COUNT_DISTINCT(user_id) as unique_visitors,
    COUNT(*) as total_page_views,
    COUNT(*) / APPROX_COUNT_DISTINCT(user_id) as avg_views_per_user
FROM page_views
GROUP BY DATE_TRUNC('DAY', event_timestamp)
ORDER BY

SELECT 
    DATE_TRUNC('DAY', event_timestamp) as date,
    APPROX_COUNT_DISTINCT(user_id) as unique_visitors,
    COUNT(*) as total_page_views,
    COUNT(*) / APPROX_COUNT_DISTINCT(user_id) as avg_views_per_user
FROM page_views
GROUP BY DATE_TRUNC('DAY', event_timestamp)
ORDER BY

Sample Data (page_views table):

event_timestamp

user_id

page_url

2024-01-15 09:00

user_1

/home

2024-01-15 09:15

user_2

/products

2024-01-15 10:00

user_1

/about

2024-01-15 11:00

user_3

/home

2024-01-15 14:00

user_1

/contact

2024-01-16 09:00

user_1

/home

2024-01-16 10:00

user_4

/products

2024-01-16 11:00

user_2

/home

Result:

date

unique_visitors

total_page_views

avg_views_per_user

2024-01-15

3

5

1.67

2024-01-16

3

3

1.00

Example 2: Compare Exact vs Approximate

All Platforms:

SELECT 
    category,
    COUNT(DISTINCT customer_id) as exact_count,
    APPROX_COUNT_DISTINCT(customer_id) as approx_count,
    ABS(COUNT(DISTINCT customer_id) - APPROX_COUNT_DISTINCT(customer_id)) as difference,
    (ABS(COUNT(DISTINCT customer_id) - APPROX_COUNT_DISTINCT(customer_id)) * 100.0 
     / COUNT(DISTINCT customer_id)) as error_rate_pct
FROM purchases
GROUP BY

SELECT 
    category,
    COUNT(DISTINCT customer_id) as exact_count,
    APPROX_COUNT_DISTINCT(customer_id) as approx_count,
    ABS(COUNT(DISTINCT customer_id) - APPROX_COUNT_DISTINCT(customer_id)) as difference,
    (ABS(COUNT(DISTINCT customer_id) - APPROX_COUNT_DISTINCT(customer_id)) * 100.0 
     / COUNT(DISTINCT customer_id)) as error_rate_pct
FROM purchases
GROUP BY

SELECT 
    category,
    COUNT(DISTINCT customer_id) as exact_count,
    APPROX_COUNT_DISTINCT(customer_id) as approx_count,
    ABS(COUNT(DISTINCT customer_id) - APPROX_COUNT_DISTINCT(customer_id)) as difference,
    (ABS(COUNT(DISTINCT customer_id) - APPROX_COUNT_DISTINCT(customer_id)) * 100.0 
     / COUNT(DISTINCT customer_id)) as error_rate_pct
FROM purchases
GROUP BY

Sample Data (purchases table with 1M+ rows):

category

Exact Count

Approx Count

Difference

Error Rate %

Electronics

45,231

45,189

42

0.09%

Clothing

78,542

78,601

59

0.08%

Books

23,891

23,847

44

0.18%

Home & Garden

34,678

34,712

34

0.10%

Note: Approximate counts are typically within 2% of exact values

Example 3: Multi-Dimensional Analysis

All Platforms:

SELECT 
    country,
    device_type,
    APPROX_COUNT_DISTINCT(user_id) as unique_users,
    APPROX_COUNT_DISTINCT(session_id) as unique_sessions,
    COUNT(*) as total_events,
    COUNT(*) / APPROX_COUNT_DISTINCT(session_id) as events_per_session
FROM user_events
WHERE event_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY country, device_type
ORDER BY unique_users DESC
LIMIT 10

SELECT 
    country,
    device_type,
    APPROX_COUNT_DISTINCT(user_id) as unique_users,
    APPROX_COUNT_DISTINCT(session_id) as unique_sessions,
    COUNT(*) as total_events,
    COUNT(*) / APPROX_COUNT_DISTINCT(session_id) as events_per_session
FROM user_events
WHERE event_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY country, device_type
ORDER BY unique_users DESC
LIMIT 10

SELECT 
    country,
    device_type,
    APPROX_COUNT_DISTINCT(user_id) as unique_users,
    APPROX_COUNT_DISTINCT(session_id) as unique_sessions,
    COUNT(*) as total_events,
    COUNT(*) / APPROX_COUNT_DISTINCT(session_id) as events_per_session
FROM user_events
WHERE event_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY country, device_type
ORDER BY unique_users DESC
LIMIT 10

Sample Data (user_events table):

country

device_type

user_id

session_id

event_date

USA

mobile

u1

s1

2024-01-15

USA

mobile

u1

s1

2024-01-15

USA

desktop

u2

s2

2024-01-15

UK

mobile

u3

s3

2024-01-15

USA

mobile

u4

s4

2024-01-16

UK

tablet

u3

s5

2024-01-16

Result:

country

device_type

unique_users

unique_sessions

total_events

events_per_session

USA

mobile

2

2

3

1.50

USA

desktop

1

1

1

1.00

UK

mobile

1

1

1

1.00

UK

tablet

1

1

1

1.00

Example 4: Platform-Specific Features

Snowflake (Using HLL):

-- Snowflake supports both function names
SELECT 
    product_category,
    HLL(customer_id) as hll_count,  -- Original Snowflake function
    APPROX_COUNT_DISTINCT(customer_id) as approx_count  -- Standard SQL
FROM sales
GROUP BY

-- Snowflake supports both function names
SELECT 
    product_category,
    HLL(customer_id) as hll_count,  -- Original Snowflake function
    APPROX_COUNT_DISTINCT(customer_id) as approx_count  -- Standard SQL
FROM sales
GROUP BY

-- Snowflake supports both function names
SELECT 
    product_category,
    HLL(customer_id) as hll_count,  -- Original Snowflake function
    APPROX_COUNT_DISTINCT(customer_id) as approx_count  -- Standard SQL
FROM sales
GROUP BY

Databricks (With Custom Precision):

-- Databricks allows specifying error rate
SELECT 
    region,
    APPROX_COUNT_DISTINCT(user_id) as default_precision,  -- ~5% error
    APPROX_COUNT_DISTINCT(user_id, 0.01) as high_precision  -- ~1% error (slower)
FROM user_activity
GROUP BY

-- Databricks allows specifying error rate
SELECT 
    region,
    APPROX_COUNT_DISTINCT(user_id) as default_precision,  -- ~5% error
    APPROX_COUNT_DISTINCT(user_id, 0.01) as high_precision  -- ~1% error (slower)
FROM user_activity
GROUP BY

-- Databricks allows specifying error rate
SELECT 
    region,
    APPROX_COUNT_DISTINCT(user_id) as default_precision,  -- ~5% error
    APPROX_COUNT_DISTINCT(user_id, 0.01) as high_precision  -- ~1% error (slower)
FROM user_activity
GROUP BY

BigQuery (With HLL Sketches):

-- BigQuery HLL functions for advanced use cases
SELECT 
    date,
    APPROX_COUNT_DISTINCT(user_id) as unique_users,
    HLL_COUNT.MERGE(HLL_COUNT.INIT(user_id)) as hll_merge_count
FROM daily_users
GROUP BY

-- BigQuery HLL functions for advanced use cases
SELECT 
    date,
    APPROX_COUNT_DISTINCT(user_id) as unique_users,
    HLL_COUNT.MERGE(HLL_COUNT.INIT(user_id)) as hll_merge_count
FROM daily_users
GROUP BY

-- BigQuery HLL functions for advanced use cases
SELECT 
    date,
    APPROX_COUNT_DISTINCT(user_id) as unique_users,
    HLL_COUNT.MERGE(HLL_COUNT.INIT(user_id)) as hll_merge_count
FROM daily_users
GROUP BY

Performance Comparison

Dataset Size

COUNT(DISTINCT)

APPROX_COUNT_DISTINCT

Speed Improvement

1 Million rows

2.5 seconds

0.3 seconds

8x faster

10 Million rows

28 seconds

0.8 seconds

35x faster

100 Million rows

4.5 minutes

3 seconds

90x faster

1 Billion rows

45 minutes

12 seconds

225x faster

Performance gains increase with data size

Accuracy Characteristics

  • Typical Error Rate: ±2% (98% accurate)

  • Algorithm: HyperLogLog (HLL)

  • Memory Usage: Fixed, very small (~12KB per distinct value calculation)

  • Consistency: Results are deterministic for same data

When to Use APPROX_COUNT_DISTINCT

Use when:

  • ✅ Analyzing large datasets (millions+ rows)

  • ✅ Exact precision not required (analytics, dashboards)

  • ✅ Query performance is critical

  • ✅ Counting distinct users, sessions, devices, IPs

  • ✅ Real-time analytics requiring fast results

Use COUNT(DISTINCT) when:

  • ❌ Small datasets (< 100k rows) where performance difference is minimal

  • ❌ Financial calculations requiring exact accuracy

  • ❌ Compliance/audit scenarios

  • ❌ Small cardinality (< 1000 distinct values)

Common Use Cases

  1. Web Analytics: Unique visitors, sessions per day

  2. E-commerce: Unique customers, product views

  3. Marketing: Campaign reach, unique email opens

  4. IoT: Unique devices, sensor readings

  5. Gaming: Daily active users (DAU), monthly active users (MAU)

  6. Social Media: Unique post impressions, engagement

Best Practices

  1. Use for large datasets: 100k+ rows where performance matters

  2. Document approximation: Let stakeholders know counts are approximate

  3. Benchmark accuracy: Test on your data to understand error rates

  4. Combine with sampling: Further improve performance on massive datasets

  5. Use for trending: Perfect for dashboards showing trends over time

  6. Avoid mixing: Don't mix exact and approximate counts in comparisons

Related Functions

  • COUNT(DISTINCT) - Exact distinct count (slower)

  • APPROX_PERCENTILE - Approximate percentile calculations

  • APPROX_TOP_COUNT (BigQuery) - Approximate top-k frequent values

  • HLL_COUNT functions (BigQuery) - Advanced HyperLogLog operations

Interested to Learn More?
Try Out the Free 14-Days Trial

More Articles

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Copyright © 2026 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Copyright © 2026 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Copyright © 2026 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.