AI_EXTRACT
Overview
Extracts structured information from text, images, or documents based on natural language instructions. Supports multiple languages and file formats.
Syntax
Parameters
input (VARCHAR or FILE): Text string or file reference from a stage
instruction (VARCHAR): Natural language description of what to extract
response_format (OBJECT): Optional JSON schema defining the expected output structure
Use Cases
Extract entities from documents (names, dates, amounts)
Parse invoices and receipts
Extract key information from customer feedback
Structure unstructured data
Form filling and data entry automation
Contract analysis
Code Examples
Example 1: Extract Information from Text
Output:
Example 2: Extract from Multiple Records
Output:
Example 3: Extract from PDF Documents
Output:
Example 4: Structured Output with Schema
Output:
Example 5: Batch Processing Emails
Data Output Examples
Simple Extraction
Complex Document Parsing
Model Information
Model Used: arctic-extract
Context Window: 128,000 tokens
Max Output: 51,200 tokens
Supported Languages: Multiple (English, Spanish, French, German, etc.)
File Format Support
Text files (.txt, .md)
Documents (.pdf, .docx)
Images (.jpg, .png) - requires OCR
Structured files (.json, .xml, .csv)
Limitations & Considerations
Input Size
Maximum 128,000 tokens per input
For documents with pages: Each page = 970 tokens
Use AI_COUNT_TOKENS to check input size
Cost
Billing based on input AND output tokens
Response format schema counts as input tokens
Document pages are billed at 970 tokens per page
Accuracy
Works best with clear, specific instructions
Complex extractions may require schema definition
Results may vary with poor quality scans/images
Performance
Optimized for batch processing
Use MEDIUM or smaller warehouse
Processing time increases with document complexity
Regional Availability
AWS US West 2 (Oregon): ✓
AWS US East 1 (N. Virginia): ✓
Azure East US 2: ✓
Europe regions: ✓
Cross-region inference: ✓
Best Practices
1. Be Specific in Instructions
2. Use Response Format for Consistency
Define JSON schema when you need structured, predictable output
Helps with downstream processing
3. Handle Large Documents
4. Error Handling
Related Functions
AI_PARSE_DOCUMENT - For OCR and layout extraction
AI_COMPLETE - For more complex text generation
TO_FILE - For referencing staged files
AI_COUNT_TOKENS - Estimate token usage





