AI Data Extraction Confidence Calculation

Overview

This PR introduces a comprehensive confidence scoring system for AI data extraction. The system calculates detailed confidence scores for each extracted attribute and an overall extraction confidence score, stores them in both the database and JSON mappings, and displays them in the review page UI.

Complete Confidence Calculation Flow

High-Level Flow

1. LLM Extraction

↓

2. Parse LLM Response (extract LLM confidence + parsing confidence)

↓

3. OCR Agreement Calculation (compare extracted values with OCR text)

↓

4. OCR Confidence Calculation (evaluate OCR quality for matched text)

↓

5. Final Score Calculation (weighted combination of all components)

↓

6. Overall Confidence Calculation (average across all schema fields)

↓

7. Storage (JSON mapping + database)

↓

8. UI Display (review page widget)

Confidence Score Components

Each extracted attribute receives a ConfidenceBreakdown containing five distinct scores:

1. LLM Score (0.0-1.0)

Source: Confidence value directly from the LLM JSON response
Extraction: Parsed during JSON parsing from the confidence property
Meaning: Represents the LLM's own confidence in its extraction

2. Parsing Score (0.0-1.0)

Calculation: Determined during JSON parsing based on format validation success
Algorithm:

Dates: 1.0 if matches strict format (yyyy-MM-dd), 0.5 if parsed leniently
DateTimes: 1.0 if matches ISO 8601 format, 0.5 if parsed leniently
Numbers: 1.0 if parses successfully, 0.0 otherwise
Strings/Booleans: 1.0 if parses successfully, 0.0 otherwise

3. OCR Agreement (0.0-1.0)

Purpose: Measures how well extracted values match the OCR text
Calculation: Uses type-specific matching algorithms (detailed below)

4. OCR Confidence (0.0-1.0)

Purpose: Measures OCR quality for the matched text
Source: Uses symbol-level confidence from Google Cloud Vision
Calculation: Type-specific calculation methods (detailed below)

5. Final Score (0.0-1.0)

Purpose: Weighted combination of all components
Formula: Varies based on OCR agreement strength (detailed below)

OCR Agreement Calculation by Data Type

String Type

Algorithm:

Normalize both extracted value and OCR text (lowercase, trim, normalize whitespace)
Exact match check: If normalized OCR contains normalized extracted value → Score = 1.0
Fuzzy match: Use Fuzz.PartialRatio to find best substring match
Threshold check: If fuzzy score ≥ 0.75 → Return fuzzy score, else 0.0

Example:

Extracted: "Invoice Number"
OCR: "Document: Invoice Number 12345"
Result: Exact match found → Score = 1.0

Number Type (Long/Double)

Algorithm:

Parse extracted value (remove currency symbols, commas)
Extract all numbers from OCR using regex: [$€£]?\s*[-+]?\d{1,3}(?:,?\d{3})*(?:\.\d+)?
Exact match: If any OCR number matches within tolerance (0.01) → Score = 1.0
Partial match: Find closest number and calculate relative error:

relativeError = |extracted - closest| / max(|extracted|, |closest|)

Graduated scoring:

relativeError < 0.01 (1%) → Score = 0.9
relativeError < 0.05 (5%) → Score = 0.8
relativeError < 0.1 (10%) → Score = 0.5
relativeError ≥ 0.1 → Score = 0.0

Threshold check: Only return score if ≥ 0.5

Example:

Extracted: 1234.56
OCR: "1234.65" (decimal digits swapped)
Relative error: 0.09 / 1234.65 = 0.000073 (0.0073%)
Result: Score = 0.9 (excellent match)

Date Type

Algorithm:

Normalize extracted date to yyyy-MM-dd format
Extract all dates from OCR using multiple patterns:

MM/DD/YYYY or DD/MM/YYYY
YYYY/MM/DD
Month name formats (e.g., "Jan 15, 2024")

Normalize all OCR dates to yyyy-MM-dd
Exact match: If normalized OCR date equals normalized extracted date → Score = 1.0
Component similarity: Compare year, month, day components:

Calculate similarity: (matchingComponents / 3)
If similarity ≥ 0.67 (2 out of 3 components) → Return similarity score

No match → Score = 0.0

Example:

Extracted: "2024-01-15"
OCR: "01/15/2024"
Normalized OCR: "2024-01-15"
Result: Exact match → Score = 1.0

DateTime Type

Algorithm:

Similar to Date, but includes time components (hour, minute, second). Requires higher precision (threshold = 0.75).

Boolean Type

Algorithm:

Normalize both values (lowercase, trim)
Exact match only: If normalized values match → Score = 1.0
No partial matching → Score = 0.0 if not exact

String Type

Algorithm:

Threshold check: If agreement score < 0.75 (and not exact match) → Return 0.0
Find OCR word matching the extracted value:

Strategy 1: Exact word match → Use word confidence directly
Strategy 2: Word-by-word matching (for multi-word strings)

For word-by-word matching:

Tokenize extracted value into words
For each word:

Step 1: Try exact match in OCR words
Step 2: If no exact match, try fuzzy matching using Fuzz.Ratio (threshold ≥ 0.75)
Step 3: If multiple fuzzy matches, use context disambiguation (previous/next words)
Step 4: For fuzzy matches, use symbol-level matching

Symbol-level matching (for fuzzy matches):

Compare characters position-by-position
Only matching characters contribute to confidence
Average confidence of matching character symbols

Return average confidence across all words

Example:

Extracted: "test"
OCR words: ["test"] with confidence 0.95
Result: Exact match → Confidence = 0.95

Number Type (Long/Double)

Algorithm:

Threshold check: If agreement score < 0.5 → Return 0.0
Find OCR word matching the matched text:

Try exact match of normalized number
Fallback: Extract numeric components (e.g., "1234.65" → ["1234", "65"])
Find all matching components in OCR words

For exact matches: Use word confidence directly
For partial matches: Use symbol-level matching (digits only):

Filter to digits only (ignore commas, periods, currency symbols)
Compare digits position-by-position
Only matching digits contribute to confidence
Average confidence of matching digit symbols

If multiple components found: Combine all components for symbol-level matching

Example:

Extracted: "1234.56"
OCR: "1,234.65" (split into ["1", "234", "65"])
Matched: "1234.65" (closest number)
Components: ["1234", "65"] → Both found in OCR
Symbol-level: Compare "123456" vs "123465" → Positions 0-3 match → Average confidence of matching digits

Date Type

Algorithm:

Only calculate for exact matches (agreement score = 1.0)
Find OCR word matching the matched date text:

Strategy 1: Try exact match as single word
Strategy 2: Word-by-word matching (tokenize by spaces)

For word-by-word matching:

Match each component (year, month, day) individually
Use context disambiguation if multiple matches exist
Average confidence across all matched components

Return average confidence

Example:

Extracted: "2024-01-15"
OCR: "2024 01 15" (spaces between components)
Components: ["2024", "01", "15"]
Match each component → Average their confidences

DateTime Type

Algorithm:

Similar to Date, but includes time components (hour, minute, second) in component matching.

Boolean Type

Algorithm:

Only calculate for exact matches (agreement score = 1.0)
Find exact OCR word match
Return word confidence directly

Final Score Calculation

The final confidence score uses a weighted formula that adapts based on OCR agreement strength:

Formula Selection Logic

if (ocrAgreement == 0.0 && ocrConfidence == 0.0) {

// No OCR data available

score = (0.9 × LLM) + (0.1 × Parsing)

}

else if (ocrAgreement >= 0.8) {

// Strong OCR agreement - trust OCR more

score = (0.35 × LLM) + (0.25 × OCR Agreement) + (0.25 × OCR Confidence) + (0.15 × Parsing)

}

else {

// Weak OCR agreement - trust LLM more

score = (0.65 × LLM) + (0.15 × OCR Agreement) + (0.15 × OCR Confidence) + (0.05 × Parsing)

}

Rationale

No OCR: Relies primarily on LLM confidence (90% weight)
Strong OCR agreement (≥0.8): Balanced weighting across all components
Weak OCR agreement (<0.8): Heavily weighted toward LLM (65%) since OCR is unreliable

All scores are clamped to [0.0, 1.0] range.

Overall Extraction Confidence

Algorithm:

For each field in the extraction schema:

If field has extracted values → Use its FinalScore from confidence breakdown
If field has no extracted values → Use 0.0

Calculate average: overallConfidence = average(FinalScore for all schema fields)

Purpose: Provides a single confidence metric representing the quality of the entire extraction across all attributes.

Symbol-Level Matching Algorithm

Used for partial/fuzzy matches to calculate OCR confidence more accurately.

Process

Filter characters: Extract only relevant characters (digits for numbers, all characters for strings)
Position-by-position comparison:

Compare targetChars[i] with ocrChars[i]
If characters match → Add corresponding symbol confidence to collection
If characters don't match → Skip (don't add to confidence)

Calculate average: confidence = average(matching symbol confidences)

Example: Number Partial Match

Extracted: "1234" → digits: ['1', '2', '3', '4']

OCR: "1235" → digits: ['1', '2', '3', '5']

OCR symbols: [('1', 0.98), ('2', 0.96), ('3', 0.94), ('5', 0.92)]

Position 0: '1' == '1' → Match → Add 0.98

Position 1: '2' == '2' → Match → Add 0.96

Position 2: '3' == '3' → Match → Add 0.94

Position 3: '4' != '5' → No match → Skip

Matching confidences: [0.98, 0.96, 0.94]

Final confidence: (0.98 + 0.96 + 0.94) / 3 = 0.96

Component Handling for Numbers

When numbers are split by formatting (commas, spaces, decimals), the system handles multiple components.

Algorithm

Extract numeric components: Split on non-digits (e.g., "1234.65" → ["1234", "65"])
Find matching components: Search for each component in OCR words
Combine components: If multiple components found:

Concatenate digits in order
Concatenate symbol confidences in order
Create virtual OCR word with combined data

Symbol-level matching: Use combined components for position-by-position comparison

Example

Extracted: "1234.56"

OCR words: ["1", "234", "65"] (split by comma and decimal)

Components: ["1234", "65"]

Matching: "234" and "65" found

Combined OCR: "23465" (digits from both components)

Combined symbols: [('2', 0.96), ('3', 0.94), ('4', 0.92), ('6', 0.93), ('5', 0.91)]

Symbol-level matching against "123456" (from extracted value)

Thresholds and Configuration

Agreement Thresholds

String: 0.75 (75% fuzzy match required)
Number: 0.5 (50% relative error acceptable)
Date: 0.67 (2 out of 3 components must match)
DateTime: 0.75 (higher precision required)

Number Relative Error Thresholds

Excellent: < 0.01 (1% error) → Score = 0.9
Good: < 0.05 (5% error) → Score = 0.8
Acceptable: < 0.1 (10% error) → Score = 0.5
Unacceptable: ≥ 0.1 → Score = 0.0

Related Articles
Section 10.2.3 Artificial Intelligence: Image Quality Analysis, Data Extraction, and Redaction
10.2.3 Artificial Intelligence Section 10.2.3.1 AI Vision: Image Quality Analysis Overview AI Vision delivers automated image analysis at scale. It detects imperfections and ensures every image meets your enterprise quality benchmarks. Image Quality ...
WIB Review - Release 3.0.1
Release 3.0.1 Item ID Epic Item Name Type Priority Tags Link WS-I2034 Artificial Intelligence Add AI result confidence score indicator during data extraction Task Medium ...
Section 11 Data Management
Data Management Data Management allows administrators to watch sessions and keep data synchronized with changes to the configuration by rebuilding data at various levels. Section 11.2 Data Management Monitoring Levels Data Management is organized by ...
WIB Review - Release Notes
Release Notes Note: This section details development features with a link for the corresponding instructions within this document. The feature reference guide was started with Release 2.3.0 and only contains references from that point forward. ...
Section 6.7 Image Viewer & Image Viewer Controls
Section 6.7 Image Viewer & Image Viewer Controls There are zoom settings, search result settings, OCR highlights, image, and text feature sets in the Image Viewer. Section 6.7.1 Zoom Settings Section 6.7.1.1 Zoom In Zoom in as close as needed to ...

AI Data Extraction Confidence Calculation

AI Data Extraction Confidence Calculation

Overview

Complete Confidence Calculation Flow

High-Level Flow

Confidence Score Components

1. LLM Score (0.0-1.0)

2. Parsing Score (0.0-1.0)

3. OCR Agreement (0.0-1.0)

4. OCR Confidence (0.0-1.0)

5. Final Score (0.0-1.0)

OCR Agreement Calculation by Data Type

String Type

Number Type (Long/Double)

Date Type

DateTime Type

String Type

Number Type (Long/Double)

Date Type

DateTime Type

Boolean Type

Final Score Calculation

Formula Selection Logic

Rationale

Overall Extraction Confidence

Symbol-Level Matching Algorithm

Component Handling for Numbers

Thresholds and Configuration

Agreement Thresholds

Number Relative Error Thresholds

Related Articles

Section 10.2.3 Artificial Intelligence: Image Quality Analysis, Data Extraction, and Redaction

WIB Review - Release 3.0.1

Section 11 Data Management

WIB Review - Release Notes

Section 6.7 Image Viewer & Image Viewer Controls