SQL Keywords

AI_SIMILARITY

Feb 23, 2026

min read

AI_SIMILARITY

Overview

Calculates the embedding similarity between two text or image inputs using cosine similarity.

Syntax

AI_SIMILARITY(
  input1,
  input2
)

AI_SIMILARITY(
  input1,
  input2
)

AI_SIMILARITY(
  input1,
  input2
)

Parameters

input1 (VARCHAR or FILE): First text string or file
input2 (VARCHAR or FILE): Second text string or file

Use Cases

Find duplicate or near-duplicate content
Content recommendation systems
Plagiarism detection
Similar document matching
Query matching
Semantic search

Code Examples

Example 1: Calculate Text Similarity

SELECT AI_SIMILARITY(
    'Snowflake is a cloud data platform',
    'Snowflake is a cloud-based data warehouse'
) AS

SELECT AI_SIMILARITY(
    'Snowflake is a cloud data platform',
    'Snowflake is a cloud-based data warehouse'
) AS

SELECT AI_SIMILARITY(
    'Snowflake is a cloud data platform',
    'Snowflake is a cloud-based data warehouse'
) AS

Output:

similarity_score
----------------
0.92

similarity_score
----------------
0.92

similarity_score
----------------
0.92

Example 2: Find Similar Products

WITH target_product AS (
    SELECT description
    FROM products
    WHERE product_id = 101
)
SELECT 
    p.product_id,
    p.product_name,
    AI_SIMILARITY(p.description, t.description) AS similarity
FROM products p, target_product t
WHERE p.product_id != 101
ORDER BY similarity DESC
LIMIT 10

WITH target_product AS (
    SELECT description
    FROM products
    WHERE product_id = 101
)
SELECT 
    p.product_id,
    p.product_name,
    AI_SIMILARITY(p.description, t.description) AS similarity
FROM products p, target_product t
WHERE p.product_id != 101
ORDER BY similarity DESC
LIMIT 10

WITH target_product AS (
    SELECT description
    FROM products
    WHERE product_id = 101
)
SELECT 
    p.product_id,
    p.product_name,
    AI_SIMILARITY(p.description, t.description) AS similarity
FROM products p, target_product t
WHERE p.product_id != 101
ORDER BY similarity DESC
LIMIT 10

Example 3: Detect Duplicate Support Tickets

SELECT 
    t1.ticket_id AS ticket_1,
    t2.ticket_id AS ticket_2,
    AI_SIMILARITY(t1.description, t2.description) AS similarity
FROM support_tickets t1
JOIN support_tickets t2 
    ON t1.ticket_id < t2.ticket_id
WHERE AI_SIMILARITY(t1.description, t2.description) > 0.85
  AND t1.created_date >= CURRENT_DATE - 7

SELECT 
    t1.ticket_id AS ticket_1,
    t2.ticket_id AS ticket_2,
    AI_SIMILARITY(t1.description, t2.description) AS similarity
FROM support_tickets t1
JOIN support_tickets t2 
    ON t1.ticket_id < t2.ticket_id
WHERE AI_SIMILARITY(t1.description, t2.description) > 0.85
  AND t1.created_date >= CURRENT_DATE - 7

SELECT 
    t1.ticket_id AS ticket_1,
    t2.ticket_id AS ticket_2,
    AI_SIMILARITY(t1.description, t2.description) AS similarity
FROM support_tickets t1
JOIN support_tickets t2 
    ON t1.ticket_id < t2.ticket_id
WHERE AI_SIMILARITY(t1.description, t2.description) > 0.85
  AND t1.created_date >= CURRENT_DATE - 7

Example 4: Content Recommendation

WITH user_preferences AS (
    SELECT 'I enjoy reading about machine learning and AI' AS preference
)
SELECT 
    article_id,
    title,
    AI_SIMILARITY(content_summary, u.preference) AS relevance_score
FROM articles a, user_preferences u
WHERE published_date >= CURRENT_DATE - 30
ORDER BY relevance_score DESC
LIMIT 20

WITH user_preferences AS (
    SELECT 'I enjoy reading about machine learning and AI' AS preference
)
SELECT 
    article_id,
    title,
    AI_SIMILARITY(content_summary, u.preference) AS relevance_score
FROM articles a, user_preferences u
WHERE published_date >= CURRENT_DATE - 30
ORDER BY relevance_score DESC
LIMIT 20

WITH user_preferences AS (
    SELECT 'I enjoy reading about machine learning and AI' AS preference
)
SELECT 
    article_id,
    title,
    AI_SIMILARITY(content_summary, u.preference) AS relevance_score
FROM articles a, user_preferences u
WHERE published_date >= CURRENT_DATE - 30
ORDER BY relevance_score DESC
LIMIT 20

Example 5: Image Similarity

SELECT 
    img1.image_id,
    img2.image_id,
    AI_SIMILARITY(
        TO_FILE('@images/' || img1.filename),
        TO_FILE('@images/' || img2.filename)
    ) AS image_similarity
FROM image_catalog img1
CROSS JOIN image_catalog img2
WHERE img1.image_id < img2.image_id
  AND AI_SIMILARITY(
        TO_FILE('@images/' || img1.filename),
        TO_FILE('@images/' || img2.filename)
    ) > 0.9

SELECT 
    img1.image_id,
    img2.image_id,
    AI_SIMILARITY(
        TO_FILE('@images/' || img1.filename),
        TO_FILE('@images/' || img2.filename)
    ) AS image_similarity
FROM image_catalog img1
CROSS JOIN image_catalog img2
WHERE img1.image_id < img2.image_id
  AND AI_SIMILARITY(
        TO_FILE('@images/' || img1.filename),
        TO_FILE('@images/' || img2.filename)
    ) > 0.9

SELECT 
    img1.image_id,
    img2.image_id,
    AI_SIMILARITY(
        TO_FILE('@images/' || img1.filename),
        TO_FILE('@images/' || img2.filename)
    ) AS image_similarity
FROM image_catalog img1
CROSS JOIN image_catalog img2
WHERE img1.image_id < img2.image_id
  AND AI_SIMILARITY(
        TO_FILE('@images/' || img1.filename),
        TO_FILE('@images/' || img2.filename)
    ) > 0.9

Data Output Examples

Similarity Scores

Text 1: "How to optimize SQL queries"
Text 2: "SQL query optimization techniques"
Similarity: 0.94 (very similar)

Text 1: "Machine learning algorithms"
Text 2: "Database management systems"
Similarity: 0.32 (somewhat related)

Text 1: "Python programming tutorial"
Text 2: "Chocolate cake recipe"
Similarity: 0.05 (not related)

Text 1: "How to optimize SQL queries"
Text 2: "SQL query optimization techniques"
Similarity: 0.94 (very similar)

Text 1: "Machine learning algorithms"
Text 2: "Database management systems"
Similarity: 0.32 (somewhat related)

Text 1: "Python programming tutorial"
Text 2: "Chocolate cake recipe"
Similarity: 0.05 (not related)

Text 1: "How to optimize SQL queries"
Text 2: "SQL query optimization techniques"
Similarity: 0.94 (very similar)

Text 1: "Machine learning algorithms"
Text 2: "Database management systems"
Similarity: 0.32 (somewhat related)

Text 1: "Python programming tutorial"
Text 2: "Chocolate cake recipe"
Similarity: 0.05 (not related)

Duplicate Detection

Original Ticket: "Unable to login to application"
Potential Duplicates:
- "Cannot log into the app" - 0.91 similarity
- "Login page not working" - 0.87 similarity
- "Authentication failure" - 0.73 similarity

Original Ticket: "Unable to login to application"
Potential Duplicates:
- "Cannot log into the app" - 0.91 similarity
- "Login page not working" - 0.87 similarity
- "Authentication failure" - 0.73 similarity

Original Ticket: "Unable to login to application"
Potential Duplicates:
- "Cannot log into the app" - 0.91 similarity
- "Login page not working" - 0.87 similarity
- "Authentication failure" - 0.73 similarity

Model Information

Similarity Metric: Cosine similarity
Output Range: 0.0 (no similarity) to 1.0 (identical)
Supported Inputs: Text and images

Limitations & Considerations

Input Size

Subject to embedding model's context window
Typically 512 to 128K tokens depending on model
Use AI_COUNT_TOKENS to verify

Cost

Generates embeddings for both inputs
Billing based on input tokens for both
Consider caching embeddings for repeated comparisons

Performance

Suitable for batch processing
Use MEDIUM warehouse or smaller
For large-scale similarity, pre-compute embeddings

Regional Availability

AWS US West/East: ✓
Azure East US: ✓
EU regions: ✓
Cross-region inference: ✓

Best Practices

1. Set Appropriate Thresholds

-- Define similarity categories
SELECT 
    CASE 
        WHEN AI_SIMILARITY(text1, text2) > 0.9 THEN 'Near Duplicate'
        WHEN AI_SIMILARITY(text1, text2) > 0.7 THEN 'Highly Similar'
        WHEN AI_SIMILARITY(text1, text2) > 0.5 THEN 'Moderately Similar'
        ELSE 'Different'
    END AS similarity_category
FROM

-- Define similarity categories
SELECT 
    CASE 
        WHEN AI_SIMILARITY(text1, text2) > 0.9 THEN 'Near Duplicate'
        WHEN AI_SIMILARITY(text1, text2) > 0.7 THEN 'Highly Similar'
        WHEN AI_SIMILARITY(text1, text2) > 0.5 THEN 'Moderately Similar'
        ELSE 'Different'
    END AS similarity_category
FROM

-- Define similarity categories
SELECT 
    CASE 
        WHEN AI_SIMILARITY(text1, text2) > 0.9 THEN 'Near Duplicate'
        WHEN AI_SIMILARITY(text1, text2) > 0.7 THEN 'Highly Similar'
        WHEN AI_SIMILARITY(text1, text2) > 0.5 THEN 'Moderately Similar'
        ELSE 'Different'
    END AS similarity_category
FROM

2. Cache Embeddings for Efficiency

-- Instead of computing similarity repeatedly
-- Pre-compute and store embeddings
CREATE TABLE article_embeddings AS
SELECT 
    article_id,
    AI_EMBED('snowflake-arctic-embed-l-v2.0', content) AS embedding
FROM articles;

-- Then use VECTOR_COSINE_SIMILARITY for faster comparisons
SELECT 
    a1.article_id,
    a2.article_id,
    VECTOR_COSINE_SIMILARITY(a1.embedding, a2.embedding) AS similarity
FROM article_embeddings a1, article_embeddings a2
WHERE

-- Instead of computing similarity repeatedly
-- Pre-compute and store embeddings
CREATE TABLE article_embeddings AS
SELECT 
    article_id,
    AI_EMBED('snowflake-arctic-embed-l-v2.0', content) AS embedding
FROM articles;

-- Then use VECTOR_COSINE_SIMILARITY for faster comparisons
SELECT 
    a1.article_id,
    a2.article_id,
    VECTOR_COSINE_SIMILARITY(a1.embedding, a2.embedding) AS similarity
FROM article_embeddings a1, article_embeddings a2
WHERE

-- Instead of computing similarity repeatedly
-- Pre-compute and store embeddings
CREATE TABLE article_embeddings AS
SELECT 
    article_id,
    AI_EMBED('snowflake-arctic-embed-l-v2.0', content) AS embedding
FROM articles;

-- Then use VECTOR_COSINE_SIMILARITY for faster comparisons
SELECT 
    a1.article_id,
    a2.article_id,
    VECTOR_COSINE_SIMILARITY(a1.embedding, a2.embedding) AS similarity
FROM article_embeddings a1, article_embeddings a2
WHERE

3. Optimize Comparison Queries

-- Limit comparisons to relevant subsets
SELECT 
    t1.ticket_id,
    t2.ticket_id,
    AI_SIMILARITY(t1.description, t2.description) AS similarity
FROM support_tickets t1
JOIN support_tickets t2 
    ON t1.category = t2.category  -- Same category only
    AND t1.ticket_id < t2.ticket_id
    AND ABS(DATEDIFF('day', t1.created_date, t2.created_date)) <= 7  -- Within week
WHERE AI_SIMILARITY(t1.description, t2.description) > 0.8

-- Limit comparisons to relevant subsets
SELECT 
    t1.ticket_id,
    t2.ticket_id,
    AI_SIMILARITY(t1.description, t2.description) AS similarity
FROM support_tickets t1
JOIN support_tickets t2 
    ON t1.category = t2.category  -- Same category only
    AND t1.ticket_id < t2.ticket_id
    AND ABS(DATEDIFF('day', t1.created_date, t2.created_date)) <= 7  -- Within week
WHERE AI_SIMILARITY(t1.description, t2.description) > 0.8

-- Limit comparisons to relevant subsets
SELECT 
    t1.ticket_id,
    t2.ticket_id,
    AI_SIMILARITY(t1.description, t2.description) AS similarity
FROM support_tickets t1
JOIN support_tickets t2 
    ON t1.category = t2.category  -- Same category only
    AND t1.ticket_id < t2.ticket_id
    AND ABS(DATEDIFF('day', t1.created_date, t2.created_date)) <= 7  -- Within week
WHERE AI_SIMILARITY(t1.description, t2.description) > 0.8

4. Combine with Filters

WITH similar_products AS (
    SELECT 
        p1.product_id AS product_a,
        p2.product_id AS product_b,
        AI_SIMILARITY(p1.description, p2.description) AS similarity
    FROM products p1, products p2
    WHERE p1.product_id < p2.product_id
      AND p1.category = p2.category  -- Pre-filter by category
)
SELECT *
FROM similar_products
WHERE similarity > 0.75

WITH similar_products AS (
    SELECT 
        p1.product_id AS product_a,
        p2.product_id AS product_b,
        AI_SIMILARITY(p1.description, p2.description) AS similarity
    FROM products p1, products p2
    WHERE p1.product_id < p2.product_id
      AND p1.category = p2.category  -- Pre-filter by category
)
SELECT *
FROM similar_products
WHERE similarity > 0.75

WITH similar_products AS (
    SELECT 
        p1.product_id AS product_a,
        p2.product_id AS product_b,
        AI_SIMILARITY(p1.description, p2.description) AS similarity
    FROM products p1, products p2
    WHERE p1.product_id < p2.product_id
      AND p1.category = p2.category  -- Pre-filter by category
)
SELECT *
FROM similar_products
WHERE similarity > 0.75

Common Use Cases

Deduplication

-- Find and flag duplicate customer inquiries
SELECT 
    newer.inquiry_id,
    older.inquiry_id AS potential_duplicate,
    AI_SIMILARITY(newer.inquiry_text, older.inquiry_text) AS similarity
FROM inquiries newer
JOIN inquiries older
    ON newer.created_at > older.created_at
    AND older.created_at >= newer.created_at - INTERVAL '7 days'
WHERE AI_SIMILARITY(newer.inquiry_text, older.inquiry_text) > 0.9

-- Find and flag duplicate customer inquiries
SELECT 
    newer.inquiry_id,
    older.inquiry_id AS potential_duplicate,
    AI_SIMILARITY(newer.inquiry_text, older.inquiry_text) AS similarity
FROM inquiries newer
JOIN inquiries older
    ON newer.created_at > older.created_at
    AND older.created_at >= newer.created_at - INTERVAL '7 days'
WHERE AI_SIMILARITY(newer.inquiry_text, older.inquiry_text) > 0.9

-- Find and flag duplicate customer inquiries
SELECT 
    newer.inquiry_id,
    older.inquiry_id AS potential_duplicate,
    AI_SIMILARITY(newer.inquiry_text, older.inquiry_text) AS similarity
FROM inquiries newer
JOIN inquiries older
    ON newer.created_at > older.created_at
    AND older.created_at >= newer.created_at - INTERVAL '7 days'
WHERE AI_SIMILARITY(newer.inquiry_text, older.inquiry_text) > 0.9

Recommendation Engine

WITH user_history AS (
    SELECT array_agg(article_content) AS read_articles
    FROM reading_history
    WHERE user_id = 123
)
SELECT 
    a.article_id,
    a.title,
    MAX(AI_SIMILARITY(a.content, h.read_article)) AS max_similarity
FROM articles a, user_history, LATERAL FLATTEN(read_articles) h
WHERE a.article_id NOT IN (SELECT article_id FROM reading_history WHERE user_id = 123)
GROUP BY a.article_id, a.title
ORDER BY max_similarity DESC
LIMIT 10

WITH user_history AS (
    SELECT array_agg(article_content) AS read_articles
    FROM reading_history
    WHERE user_id = 123
)
SELECT 
    a.article_id,
    a.title,
    MAX(AI_SIMILARITY(a.content, h.read_article)) AS max_similarity
FROM articles a, user_history, LATERAL FLATTEN(read_articles) h
WHERE a.article_id NOT IN (SELECT article_id FROM reading_history WHERE user_id = 123)
GROUP BY a.article_id, a.title
ORDER BY max_similarity DESC
LIMIT 10

WITH user_history AS (
    SELECT array_agg(article_content) AS read_articles
    FROM reading_history
    WHERE user_id = 123
)
SELECT 
    a.article_id,
    a.title,
    MAX(AI_SIMILARITY(a.content, h.read_article)) AS max_similarity
FROM articles a, user_history, LATERAL FLATTEN(read_articles) h
WHERE a.article_id NOT IN (SELECT article_id FROM reading_history WHERE user_id = 123)
GROUP BY a.article_id, a.title
ORDER BY max_similarity DESC
LIMIT 10

Related Functions

AI_EMBED - Generate embeddings for storage
VECTOR_COSINE_SIMILARITY - Direct vector similarity
AI_CLASSIFY - For categorical similarity

Interested to Learn More?
Try Out the Free 14-Days Trial

Start free trial

Learn

Feb 18, 2026

BigQuery Global Queries: How to Run Cross-Region SQL in 2026

Learn

Feb 18, 2026

BigQuery Global Queries: How to Run Cross-Region SQL in 2026

Learn

Feb 18, 2026

BigQuery Global Queries: How to Run Cross-Region SQL in 2026

Analytics

Feb 18, 2026

Context Engineering and AI Quality for Data Teams

Analytics

Feb 18, 2026

Context Engineering and AI Quality for Data Teams

Analytics

Feb 18, 2026

Context Engineering and AI Quality for Data Teams

Product

Feb 18, 2026

Accelerate Analytics Development with Paradime and Tableau

Product

Feb 18, 2026

Accelerate Analytics Development with Paradime and Tableau

Product

Feb 18, 2026

Accelerate Analytics Development with Paradime and Tableau

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Start for free

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Start for free

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Start for free

Platform

Resources

ADD-ONs

Industries

About

Legal

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Start for free

Platform

Resources