SQL Keywords

SQL Keywords

AI.SIMILARITY (2)

Feb 23, 2026

·

5

min read

Category: Embedding & Vector Function

Description

Calculates the cosine similarity between two embedding vectors. Returns a FLOAT64 value between -1 and 1, where 1 indicates identical vectors, 0 indicates orthogonal (unrelated) vectors, and -1 indicates opposite vectors. Commonly used to find semantically similar content.

Use Cases

Semantic Search: Rank documents by relevance to a query

Recommendation: Find similar products or content

Duplicate Detection: Identify near-duplicate content

Content Clustering: Group similar items together

Quality Scoring: Measure similarity to ideal examples

Syntax

AI.SIMILARITY(
  embedding_1,
  embedding_2
)
AI.SIMILARITY(
  embedding_1,
  embedding_2
)
AI.SIMILARITY(
  embedding_1,
  embedding_2
)

Parameters

embedding_1: ARRAY - First embedding vector

embedding_2: ARRAY - Second embedding vector

Both embeddings must have the same dimensionality.

Code Examples

Example 1: Find Similar Products

-- Get query embedding
DECLARE query_embedding ARRAY<float64>;

SET query_embedding = (
  SELECT AI.EMBED(
    model => 'text-embedding-004',
    content => 'noise cancelling wireless headphones',
    task_type => 'RETRIEVAL_QUERY',
    connection_id => 'us.my_vertex_connection'
  )
);

-- Find top 10 similar products
SELECT 
  product_id,
  product_name,
  AI.SIMILARITY(
    query_embedding,
    product_embedding
  ) AS similarity_score
FROM products_with_embeddings
ORDER BY similarity_score DESC
LIMIT 10

-- Get query embedding
DECLARE query_embedding ARRAY<float64>;

SET query_embedding = (
  SELECT AI.EMBED(
    model => 'text-embedding-004',
    content => 'noise cancelling wireless headphones',
    task_type => 'RETRIEVAL_QUERY',
    connection_id => 'us.my_vertex_connection'
  )
);

-- Find top 10 similar products
SELECT 
  product_id,
  product_name,
  AI.SIMILARITY(
    query_embedding,
    product_embedding
  ) AS similarity_score
FROM products_with_embeddings
ORDER BY similarity_score DESC
LIMIT 10

-- Get query embedding
DECLARE query_embedding ARRAY<float64>;

SET query_embedding = (
  SELECT AI.EMBED(
    model => 'text-embedding-004',
    content => 'noise cancelling wireless headphones',
    task_type => 'RETRIEVAL_QUERY',
    connection_id => 'us.my_vertex_connection'
  )
);

-- Find top 10 similar products
SELECT 
  product_id,
  product_name,
  AI.SIMILARITY(
    query_embedding,
    product_embedding
  ) AS similarity_score
FROM products_with_embeddings
ORDER BY similarity_score DESC
LIMIT 10

Example 2: Compare Two Documents

SELECT 
  doc1.document_id AS doc1_id,
  doc2.document_id AS doc2_id,
  AI.SIMILARITY(
    doc1.embedding,
    doc2.embedding
  ) AS similarity
FROM document_embeddings doc1
CROSS JOIN document_embeddings doc2
WHERE doc1.document_id < doc2.document_id  -- Avoid duplicate pairs
  AND AI.SIMILARITY(doc1.embedding, doc2.embedding) > 0.8  -- Only high similarity
ORDER BY similarity DESC

SELECT 
  doc1.document_id AS doc1_id,
  doc2.document_id AS doc2_id,
  AI.SIMILARITY(
    doc1.embedding,
    doc2.embedding
  ) AS similarity
FROM document_embeddings doc1
CROSS JOIN document_embeddings doc2
WHERE doc1.document_id < doc2.document_id  -- Avoid duplicate pairs
  AND AI.SIMILARITY(doc1.embedding, doc2.embedding) > 0.8  -- Only high similarity
ORDER BY similarity DESC

SELECT 
  doc1.document_id AS doc1_id,
  doc2.document_id AS doc2_id,
  AI.SIMILARITY(
    doc1.embedding,
    doc2.embedding
  ) AS similarity
FROM document_embeddings doc1
CROSS JOIN document_embeddings doc2
WHERE doc1.document_id < doc2.document_id  -- Avoid duplicate pairs
  AND AI.SIMILARITY(doc1.embedding, doc2.embedding) > 0.8  -- Only high similarity
ORDER BY similarity DESC

Example 3: Find Near-Duplicates

WITH pairs AS (
  SELECT 
    a.article_id AS id1,
    b.article_id AS id2,
    AI.SIMILARITY(a.embedding, b.embedding) AS similarity
  FROM article_embeddings a
  CROSS JOIN article_embeddings b
  WHERE a.article_id < b.article_id
)
SELECT 
  id1,
  id2,
  similarity
FROM pairs
WHERE similarity > 0.95  -- Very high similarity threshold for duplicates
ORDER BY similarity DESC

WITH pairs AS (
  SELECT 
    a.article_id AS id1,
    b.article_id AS id2,
    AI.SIMILARITY(a.embedding, b.embedding) AS similarity
  FROM article_embeddings a
  CROSS JOIN article_embeddings b
  WHERE a.article_id < b.article_id
)
SELECT 
  id1,
  id2,
  similarity
FROM pairs
WHERE similarity > 0.95  -- Very high similarity threshold for duplicates
ORDER BY similarity DESC

WITH pairs AS (
  SELECT 
    a.article_id AS id1,
    b.article_id AS id2,
    AI.SIMILARITY(a.embedding, b.embedding) AS similarity
  FROM article_embeddings a
  CROSS JOIN article_embeddings b
  WHERE a.article_id < b.article_id
)
SELECT 
  id1,
  id2,
  similarity
FROM pairs
WHERE similarity > 0.95  -- Very high similarity threshold for duplicates
ORDER BY similarity DESC

Example 4: Recommendation System

-- Given a user's favorite product, find similar items
DECLARE favorite_product_embedding ARRAY<float64>;

SET favorite_product_embedding = (
  SELECT product_embedding
  FROM products_with_embeddings
  WHERE product_id = 'PROD-12345'
);

SELECT 
  product_id,
  product_name,
  price,
  category,
  AI.SIMILARITY(
    favorite_product_embedding,
    product_embedding
  ) AS similarity
FROM products_with_embeddings
WHERE product_id != 'PROD-12345'  -- Exclude the original product
  AND category IN ('electronics', 'accessories')  -- Filter by category
ORDER BY similarity DESC
LIMIT 5

-- Given a user's favorite product, find similar items
DECLARE favorite_product_embedding ARRAY<float64>;

SET favorite_product_embedding = (
  SELECT product_embedding
  FROM products_with_embeddings
  WHERE product_id = 'PROD-12345'
);

SELECT 
  product_id,
  product_name,
  price,
  category,
  AI.SIMILARITY(
    favorite_product_embedding,
    product_embedding
  ) AS similarity
FROM products_with_embeddings
WHERE product_id != 'PROD-12345'  -- Exclude the original product
  AND category IN ('electronics', 'accessories')  -- Filter by category
ORDER BY similarity DESC
LIMIT 5

-- Given a user's favorite product, find similar items
DECLARE favorite_product_embedding ARRAY<float64>;

SET favorite_product_embedding = (
  SELECT product_embedding
  FROM products_with_embeddings
  WHERE product_id = 'PROD-12345'
);

SELECT 
  product_id,
  product_name,
  price,
  category,
  AI.SIMILARITY(
    favorite_product_embedding,
    product_embedding
  ) AS similarity
FROM products_with_embeddings
WHERE product_id != 'PROD-12345'  -- Exclude the original product
  AND category IN ('electronics', 'accessories')  -- Filter by category
ORDER BY similarity DESC
LIMIT 5

Example 5: Content Quality Scoring

-- Compare content against high-quality reference examples
DECLARE reference_embedding ARRAY<float64>;

SET reference_embedding = (
  SELECT AVG(embedding)  -- Average of multiple good examples
  FROM article_embeddings
  WHERE quality_rating >= 4.5
);

SELECT 
  article_id,
  title,
  AI.SIMILARITY(
    reference_embedding,
    embedding
  ) AS quality_similarity
FROM article_embeddings
WHERE published_date >= CURRENT_DATE() - 7
ORDER BY quality_similarity DESC

-- Compare content against high-quality reference examples
DECLARE reference_embedding ARRAY<float64>;

SET reference_embedding = (
  SELECT AVG(embedding)  -- Average of multiple good examples
  FROM article_embeddings
  WHERE quality_rating >= 4.5
);

SELECT 
  article_id,
  title,
  AI.SIMILARITY(
    reference_embedding,
    embedding
  ) AS quality_similarity
FROM article_embeddings
WHERE published_date >= CURRENT_DATE() - 7
ORDER BY quality_similarity DESC

-- Compare content against high-quality reference examples
DECLARE reference_embedding ARRAY<float64>;

SET reference_embedding = (
  SELECT AVG(embedding)  -- Average of multiple good examples
  FROM article_embeddings
  WHERE quality_rating >= 4.5
);

SELECT 
  article_id,
  title,
  AI.SIMILARITY(
    reference_embedding,
    embedding
  ) AS quality_similarity
FROM article_embeddings
WHERE published_date >= CURRENT_DATE() - 7
ORDER BY quality_similarity DESC

Example 6: Multi-Query Search

-- Search with multiple related queries and aggregate results
WITH query_embeddings AS (
  SELECT 
    AI.EMBED(
      model => 'text-embedding-004',
      content => query_text,
      task_type => 'RETRIEVAL_QUERY',
      connection_id => 'us.my_vertex_connection'
    ) AS embedding
  FROM UNNEST([
    'wireless headphones',
    'bluetooth earbuds',
    'noise cancelling audio'
  ]) AS query_text
),
scores AS (
  SELECT 
    p.product_id,
    p.product_name,
    MAX(AI.SIMILARITY(q.embedding, p.product_embedding)) AS max_similarity
  FROM products_with_embeddings p
  CROSS JOIN query_embeddings q
  GROUP BY p.product_id, p.product_name
)
SELECT *
FROM scores
ORDER BY max_similarity DESC
LIMIT 20

-- Search with multiple related queries and aggregate results
WITH query_embeddings AS (
  SELECT 
    AI.EMBED(
      model => 'text-embedding-004',
      content => query_text,
      task_type => 'RETRIEVAL_QUERY',
      connection_id => 'us.my_vertex_connection'
    ) AS embedding
  FROM UNNEST([
    'wireless headphones',
    'bluetooth earbuds',
    'noise cancelling audio'
  ]) AS query_text
),
scores AS (
  SELECT 
    p.product_id,
    p.product_name,
    MAX(AI.SIMILARITY(q.embedding, p.product_embedding)) AS max_similarity
  FROM products_with_embeddings p
  CROSS JOIN query_embeddings q
  GROUP BY p.product_id, p.product_name
)
SELECT *
FROM scores
ORDER BY max_similarity DESC
LIMIT 20

-- Search with multiple related queries and aggregate results
WITH query_embeddings AS (
  SELECT 
    AI.EMBED(
      model => 'text-embedding-004',
      content => query_text,
      task_type => 'RETRIEVAL_QUERY',
      connection_id => 'us.my_vertex_connection'
    ) AS embedding
  FROM UNNEST([
    'wireless headphones',
    'bluetooth earbuds',
    'noise cancelling audio'
  ]) AS query_text
),
scores AS (
  SELECT 
    p.product_id,
    p.product_name,
    MAX(AI.SIMILARITY(q.embedding, p.product_embedding)) AS max_similarity
  FROM products_with_embeddings p
  CROSS JOIN query_embeddings q
  GROUP BY p.product_id, p.product_name
)
SELECT *
FROM scores
ORDER BY max_similarity DESC
LIMIT 20

Data Output Examples

Product Similarity Search

product_name

similarity_score

"Sony WH-1000XM5 Noise Cancelling Headphones"

0.94

"Bose QuietComfort 45 Wireless"

0.91

"Apple AirPods Max"

0.88

"Sennheiser Momentum 4"

0.85

Duplicate Detection

article_id_1

article_id_2

similarity

A001

A045

0.98

A023

A089

0.97

A012

A034

0.96

Similarity Score Interpretation

0.95 - 1.0: Near duplicates or very highly related

0.85 - 0.95: Highly similar, strong semantic relationship

0.70 - 0.85: Moderately similar, related content

0.50 - 0.70: Somewhat similar, loose relationship

0.0 - 0.50: Low similarity, mostly unrelated

< 0.0: Opposite or contradictory (rare with text embeddings)

Best Practices

Use appropriate thresholds: Adjust similarity thresholds based on use case

Combine with filters: Use WHERE clauses to reduce computation

Create vector indexes: For large-scale similarity search

Normalize embeddings: Ensure embeddings are normalized (unit length)

Batch comparisons: Use efficient query patterns to avoid N×N comparisons

Consider alternatives: Use APPROXIMATE_SIMILARITY for very large datasets

When to Use

✅ Use for calculating similarity between two vectors

✅ Use in ORDER BY for ranking results

✅ Use in WHERE for filtering by similarity threshold

✅ Use for recommendation systems

Alternatives

COSINE_DISTANCE: Returns distance (1 - similarity)

EUCLIDEAN_DISTANCE: L2 distance between vectors

DOT_PRODUCT: Fast but requires normalized vectors

VECTOR_SEARCH: Optimized vector search with indexes

APPROXIMATE_SIMILARITY: Faster approximate similarity for large datasets

Platform Support

Regions: All BigQuery regions

Preview Status: Generally Available (GA)

Cost: Standard BigQuery compute pricing (no additional Vertex AI cost)

Performance: O(n) complexity for each comparison

Returns

FLOAT64 value between -1 and 1:

  • 1.0: Identical vectors

  • 0.5-1.0: Similar

  • 0.0: Unrelated (orthogonal)

  • -1.0 to 0.0: Dissimilar/opposite

Returns NULL if either embedding is NULL or if dimensions don't match.

Performance Tips

Use vector indexes: Create VECTOR INDEX for large-scale search

Filter first: Apply traditional WHERE clauses before similarity calculations

Limit results: Use LIMIT to reduce computation

Pre-compute embeddings: Store embeddings rather than generating on-the-fly

Use clustering: Partition data for faster search

Interested to Learn More?
Try Out the Free 14-Days Trial

More Articles

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Copyright © 2026 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Copyright © 2026 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Copyright © 2026 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.