SQL Keywords

SQL Keywords

REGEXP_EXTRACT / REGEXP_SUBSTR

Feb 23, 2026

·

5

min read

Category: String Functions

Platform Support:

✅ Snowflake (REGEXP_SUBSTR) | ✅ BigQuery (REGEXP_EXTRACT) | ✅ Databricks (REGEXP_EXTRACT)

Description

Extracts substrings that match a regular expression pattern. Essential for parsing unstructured text, extracting domains from emails, parsing log files, extracting codes from descriptions, and working with semi-structured string data.

Syntax by Platform

Snowflake:

REGEXP_SUBSTR(
    string, 
    pattern, 
    position,      -- starting position (default 1)
    occurrence,    -- which match to return (default 1)
    parameters,    -- 'e' for extract, 'i' for case-insensitive
    group          -- which capture group (default 0)
)
REGEXP_SUBSTR(
    string, 
    pattern, 
    position,      -- starting position (default 1)
    occurrence,    -- which match to return (default 1)
    parameters,    -- 'e' for extract, 'i' for case-insensitive
    group          -- which capture group (default 0)
)
REGEXP_SUBSTR(
    string, 
    pattern, 
    position,      -- starting position (default 1)
    occurrence,    -- which match to return (default 1)
    parameters,    -- 'e' for extract, 'i' for case-insensitive
    group          -- which capture group (default 0)
)

BigQuery:

REGEXP_EXTRACT(string, pattern)
-- Returns first capturing group
-- Use REGEXP_EXTRACT_ALL for all matches
REGEXP_EXTRACT(string, pattern)
-- Returns first capturing group
-- Use REGEXP_EXTRACT_ALL for all matches
REGEXP_EXTRACT(string, pattern)
-- Returns first capturing group
-- Use REGEXP_EXTRACT_ALL for all matches

Databricks:

REGEXP_EXTRACT(
    string, 
    pattern, 
    group_index    -- 0 for full match, 1+ for capture groups
)
REGEXP_EXTRACT(
    string, 
    pattern, 
    group_index    -- 0 for full match, 1+ for capture groups
)
REGEXP_EXTRACT(
    string, 
    pattern, 
    group_index    -- 0 for full match, 1+ for capture groups
)

Platform-Specific Notes

Snowflake:

  • Function name is REGEXP_SUBSTR (not EXTRACT)

  • Most flexible with position, occurrence, and parameters

  • Use 'e' parameter to enable capture groups

  • Group 0 returns full match, 1+ returns capture groups

BigQuery:

  • Use raw string prefix r'...' for regex patterns

  • Returns only the first capturing group

  • REGEXP_EXTRACT_ALL returns array of all matches

  • REGEXP_CONTAINS checks if pattern exists

Databricks:

  • Third parameter specifies capture group (0-based)

  • Group 0 returns entire match

  • Compatible with Java regex syntax

Example 1: Extract Domain from Email

Snowflake:

SELECT 
    email,
    REGEXP_SUBSTR(email, '@(.+)$', 1, 1, 'e', 1) as domain
FROM

SELECT 
    email,
    REGEXP_SUBSTR(email, '@(.+)$', 1, 1, 'e', 1) as domain
FROM

SELECT 
    email,
    REGEXP_SUBSTR(email, '@(.+)$', 1, 1, 'e', 1) as domain
FROM

BigQuery:

SELECT 
    email,
    REGEXP_EXTRACT(email, r'@(.+)$') as domain
FROM

SELECT 
    email,
    REGEXP_EXTRACT(email, r'@(.+)$') as domain
FROM

SELECT 
    email,
    REGEXP_EXTRACT(email, r'@(.+)$') as domain
FROM

Databricks:

SELECT 
    email,
    REGEXP_EXTRACT(email, '@(.+)$', 1) as domain
FROM

SELECT 
    email,
    REGEXP_EXTRACT(email, '@(.+)$', 1) as domain
FROM

SELECT 
    email,
    REGEXP_EXTRACT(email, '@(.+)$', 1) as domain
FROM

Sample Data (users table):

user_id

email

1

john.smith@gmail.com

2

maria.garcia@company.co.uk

3

sarah@outlook.com

4

ahmed.hassan@university.edu

Result (all platforms):

email

domain

john.smith@gmail.com

gmail.com

maria.garcia@company.co.uk

company.co.uk

sarah@outlook.com

outlook.com

ahmed.hassan@university.edu

university.edu

Example 2: Extract Product Code from Description

Snowflake:

SELECT 
    description,
    REGEXP_SUBSTR(description, 'SKU-([A-Z0-9]+)', 1, 1, 'e', 1) as product_code,
    REGEXP_SUBSTR(description, '([0-9]+)GB', 1, 1, 'e', 1) as storage_gb
FROM

SELECT 
    description,
    REGEXP_SUBSTR(description, 'SKU-([A-Z0-9]+)', 1, 1, 'e', 1) as product_code,
    REGEXP_SUBSTR(description, '([0-9]+)GB', 1, 1, 'e', 1) as storage_gb
FROM

SELECT 
    description,
    REGEXP_SUBSTR(description, 'SKU-([A-Z0-9]+)', 1, 1, 'e', 1) as product_code,
    REGEXP_SUBSTR(description, '([0-9]+)GB', 1, 1, 'e', 1) as storage_gb
FROM

BigQuery:

SELECT 
    description,
    REGEXP_EXTRACT(description, r'SKU-([A-Z0-9]+)') as product_code,
    REGEXP_EXTRACT(description, r'([0-9]+)GB') as storage_gb
FROM

SELECT 
    description,
    REGEXP_EXTRACT(description, r'SKU-([A-Z0-9]+)') as product_code,
    REGEXP_EXTRACT(description, r'([0-9]+)GB') as storage_gb
FROM

SELECT 
    description,
    REGEXP_EXTRACT(description, r'SKU-([A-Z0-9]+)') as product_code,
    REGEXP_EXTRACT(description, r'([0-9]+)GB') as storage_gb
FROM

Databricks:

SELECT 
    description,
    REGEXP_EXTRACT(description, 'SKU-([A-Z0-9]+)', 1) as product_code,
    REGEXP_EXTRACT(description, '([0-9]+)GB', 1) as storage_gb
FROM

SELECT 
    description,
    REGEXP_EXTRACT(description, 'SKU-([A-Z0-9]+)', 1) as product_code,
    REGEXP_EXTRACT(description, '([0-9]+)GB', 1) as storage_gb
FROM

SELECT 
    description,
    REGEXP_EXTRACT(description, 'SKU-([A-Z0-9]+)', 1) as product_code,
    REGEXP_EXTRACT(description, '([0-9]+)GB', 1) as storage_gb
FROM

Sample Data (products table):

product_id

description

1

Premium Laptop SKU-LAP123 with 512GB SSD

2

Wireless Mouse SKU-MOU456 black edition

3

Gaming Keyboard SKU-KEY789 RGB with 256GB storage

4

External Drive SKU-HDD999 2000GB capacity

Result:

description

product_code

storage_gb

Premium Laptop SKU-LAP123 with 512GB SSD

LAP123

512

Wireless Mouse SKU-MOU456 black edition

MOU456

NULL

Gaming Keyboard SKU-KEY789 RGB with 256GB storage

KEY789

256

External Drive SKU-HDD999 2000GB capacity

HDD999

2000

Example 3: Extract Phone Number Components

All Platforms:

-- Extract area code from phone number
SELECT 
    phone_number,
    -- Pattern: (123) 456-7890 or 123-456-7890
    REGEXP_EXTRACT(phone_number, '[(]?([0-9]{3})[)]?', 1) as area_code,
    REGEXP_EXTRACT(phone_number, '[0-9]{3}[-.]([0-9]{3})', 1) as exchange,
    REGEXP_EXTRACT(phone_number, '[0-9]{3}[-.]([0-9]{4})', 1) as line_number
FROM

-- Extract area code from phone number
SELECT 
    phone_number,
    -- Pattern: (123) 456-7890 or 123-456-7890
    REGEXP_EXTRACT(phone_number, '[(]?([0-9]{3})[)]?', 1) as area_code,
    REGEXP_EXTRACT(phone_number, '[0-9]{3}[-.]([0-9]{3})', 1) as exchange,
    REGEXP_EXTRACT(phone_number, '[0-9]{3}[-.]([0-9]{4})', 1) as line_number
FROM

-- Extract area code from phone number
SELECT 
    phone_number,
    -- Pattern: (123) 456-7890 or 123-456-7890
    REGEXP_EXTRACT(phone_number, '[(]?([0-9]{3})[)]?', 1) as area_code,
    REGEXP_EXTRACT(phone_number, '[0-9]{3}[-.]([0-9]{3})', 1) as exchange,
    REGEXP_EXTRACT(phone_number, '[0-9]{3}[-.]([0-9]{4})', 1) as line_number
FROM

Sample Data (contacts table):

contact_id

phone_number

1

(555) 123-4567

2

555-234-5678

3

555.345.6789

4

(555) 456-7890

Result (BigQuery/Databricks syntax):

phone_number

area_code

exchange

line_number

(555) 123-4567

555

123

4567

555-234-5678

555

234

5678

555.345.6789

555

345

6789

(555) 456-7890

555

456

7890

Example 4: Extract Multiple Matches (BigQuery)

BigQuery Only:

SELECT 
    text,
    REGEXP_EXTRACT_ALL(text, r'#(\w+)') as hashtags,
    ARRAY_LENGTH(REGEXP_EXTRACT_ALL(text, r'#(\w+)')) as hashtag_count
FROM

SELECT 
    text,
    REGEXP_EXTRACT_ALL(text, r'#(\w+)') as hashtags,
    ARRAY_LENGTH(REGEXP_EXTRACT_ALL(text, r'#(\w+)')) as hashtag_count
FROM

SELECT 
    text,
    REGEXP_EXTRACT_ALL(text, r'#(\w+)') as hashtags,
    ARRAY_LENGTH(REGEXP_EXTRACT_ALL(text, r'#(\w+)')) as hashtag_count
FROM

Sample Data (social_posts table):

post_id

text

1

Check out our new product! #tech #innovation #cloud

2

Great day at the office #worklife #productivity

3

No hashtags in this post

Result:

text

hashtags

hashtag_count

Check out our new product! #tech #innovation #cloud

["tech", "innovation", "cloud"]

3

Great day at the office #worklife #productivity

["worklife", "productivity"]

2

No hashtags in this post

[]

0

Common Regex Patterns

Pattern

Description

Example

@(.+)$

Extract domain from email

alice@gmail.comgmail.com

^([A-Z]{2,3})

First 2-3 capital letters

USA123 → USA

([0-9]{3})-([0-9]{2})-([0-9]{4})

SSN components

123-45-6789 → groups

\b([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,})\b

Email address

text with email@test.com

(https?://[\w.-]+)

Extract URL

link: https://example.com

#(\w+)

Extract hashtags

#DataEngineering → DataEngineering

Related Functions

Snowflake:

  • REGEXP_REPLACE - Replace text matching pattern

  • REGEXP_INSTR - Find position of pattern

  • REGEXP_COUNT - Count pattern occurrences

BigQuery:

  • REGEXP_REPLACE - Replace text matching pattern

  • REGEXP_CONTAINS - Check if pattern exists (returns boolean)

  • REGEXP_EXTRACT_ALL - Extract all matches as array

Databricks:

  • REGEXP_REPLACE - Replace text matching pattern

  • REGEXP_LIKE - Check if string matches pattern

Performance Considerations

  • Regex operations can be expensive on large datasets

  • Pre-filter data before applying regex when possible

  • Cache commonly used patterns

  • Consider simpler string functions (SUBSTRING, SPLIT) when regex isn't needed

  • In BigQuery, use REGEXP_CONTAINS for existence checks (faster than EXTRACT)

Best Practices

  1. Test patterns thoroughly - Use online regex testers

  2. Use raw strings in BigQuery - Prefix with r'...'

  3. Handle NULLs - Wrap in COALESCE if needed

  4. Document complex patterns - Add comments explaining the regex

  5. Use capture groups wisely - Only capture what you need

  6. Consider case sensitivity - Use flags appropriately

Interested to Learn More?
Try Out the Free 14-Days Trial

More Articles

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Copyright © 2026 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Copyright © 2026 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Copyright © 2026 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.