AI Data Extraction Confidence Calculation

AI Data Extraction Confidence Calculation

Overview

This PR introduces a comprehensive confidence scoring system for AI data extraction. The system calculates detailed confidence scores for each extracted attribute and an overall extraction confidence score, stores them in both the database and JSON mappings, and displays them in the review page UI.

Complete Confidence Calculation Flow

High-Level Flow

1. LLM Extraction

  

2. Parse LLM Response (extract LLM confidence + parsing confidence)

  

3. OCR Agreement Calculation (compare extracted values with OCR text)

  

4. OCR Confidence Calculation (evaluate OCR quality for matched text)

  

5. Final Score Calculation (weighted combination of all components)

  

6. Overall Confidence Calculation (average across all schema fields)

  

7. Storage (JSON mapping + database)

  

8. UI Display (review page widget)

Confidence Score Components

Each extracted attribute receives a ConfidenceBreakdown containing five distinct scores:

1. LLM Score (0.0-1.0)

  • Source: Confidence value directly from the LLM JSON response
  • Extraction: Parsed during JSON parsing from the confidence property
  • Meaning: Represents the LLM's own confidence in its extraction

2. Parsing Score (0.0-1.0)

  • Calculation: Determined during JSON parsing based on format validation success
  • Algorithm:
    • Dates: 1.0 if matches strict format (yyyy-MM-dd), 0.5 if parsed leniently
    • DateTimes: 1.0 if matches ISO 8601 format, 0.5 if parsed leniently
    • Numbers: 1.0 if parses successfully, 0.0 otherwise
    • Strings/Booleans: 1.0 if parses successfully, 0.0 otherwise

3. OCR Agreement (0.0-1.0)

  • Purpose: Measures how well extracted values match the OCR text
  • Calculation: Uses type-specific matching algorithms (detailed below)

4. OCR Confidence (0.0-1.0)

  • Purpose: Measures OCR quality for the matched text
  • Source: Uses symbol-level confidence from Google Cloud Vision
  • Calculation: Type-specific calculation methods (detailed below)

5. Final Score (0.0-1.0)

  • Purpose: Weighted combination of all components
  • Formula: Varies based on OCR agreement strength (detailed below)

OCR Agreement Calculation by Data Type

String Type

Algorithm:

  1. Normalize both extracted value and OCR text (lowercase, trim, normalize whitespace)
  2. Exact match check: If normalized OCR contains normalized extracted value → Score = 1.0
  3. Fuzzy match: Use Fuzz.PartialRatio to find best substring match
  4. Threshold check: If fuzzy score ≥ 0.75 → Return fuzzy score, else 0.0

Example:

  • Extracted: "Invoice Number"
  • OCR: "Document: Invoice Number 12345"
  • Result: Exact match found → Score = 1.0

Number Type (Long/Double)

Algorithm:

  1. Parse extracted value (remove currency symbols, commas)
  2. Extract all numbers from OCR using regex: [$€£]?\s*[-+]?\d{1,3}(?:,?\d{3})*(?:\.\d+)?
  3. Exact match: If any OCR number matches within tolerance (0.01) → Score = 1.0
  4. Partial match: Find closest number and calculate relative error:
    • relativeError = |extracted - closest| / max(|extracted|, |closest|)
  5. Graduated scoring:
    • relativeError < 0.01 (1%) → Score = 0.9
    • relativeError < 0.05 (5%) → Score = 0.8
    • relativeError < 0.1 (10%) → Score = 0.5
    • relativeError ≥ 0.1 → Score = 0.0
  6. Threshold check: Only return score if ≥ 0.5

Example:

  • Extracted: 1234.56
  • OCR: "1234.65" (decimal digits swapped)
  • Relative error: 0.09 / 1234.65 = 0.000073 (0.0073%)
  • Result: Score = 0.9 (excellent match)

Date Type

Algorithm:

  1. Normalize extracted date to yyyy-MM-dd format
  2. Extract all dates from OCR using multiple patterns:
    • MM/DD/YYYY or DD/MM/YYYY
    • YYYY/MM/DD
    • Month name formats (e.g., "Jan 15, 2024")
  3. Normalize all OCR dates to yyyy-MM-dd
  4. Exact match: If normalized OCR date equals normalized extracted date → Score = 1.0
  5. Component similarity: Compare year, month, day components:
    • Calculate similarity: (matchingComponents / 3)
    • If similarity ≥ 0.67 (2 out of 3 components) → Return similarity score
  6. No match → Score = 0.0

Example:

  • Extracted: "2024-01-15"
  • OCR: "01/15/2024"
  • Normalized OCR: "2024-01-15"
  • Result: Exact match → Score = 1.0

DateTime Type

Algorithm:
Similar to Date, but includes time components (hour, minute, second). Requires higher precision (threshold = 0.75).

Boolean Type

Algorithm:

  1. Normalize both values (lowercase, trim)
  2. Exact match only: If normalized values match → Score = 1.0
  3. No partial matching → Score = 0.0 if not exact

String Type

Algorithm:

  1. Threshold check: If agreement score < 0.75 (and not exact match) → Return 0.0
  2. Find OCR word matching the extracted value:
    • Strategy 1: Exact word match → Use word confidence directly
    • Strategy 2: Word-by-word matching (for multi-word strings)
  3. For word-by-word matching:
    • Tokenize extracted value into words
    • For each word:
      • Step 1: Try exact match in OCR words
      • Step 2: If no exact match, try fuzzy matching using Fuzz.Ratio (threshold ≥ 0.75)
      • Step 3: If multiple fuzzy matches, use context disambiguation (previous/next words)
      • Step 4: For fuzzy matches, use symbol-level matching
  4. Symbol-level matching (for fuzzy matches):
    • Compare characters position-by-position
    • Only matching characters contribute to confidence
    • Average confidence of matching character symbols
  5. Return average confidence across all words

Example:

  • Extracted: "test"
  • OCR words: ["test"] with confidence 0.95
  • Result: Exact match → Confidence = 0.95

Number Type (Long/Double)

Algorithm:

  1. Threshold check: If agreement score < 0.5 → Return 0.0
  2. Find OCR word matching the matched text:
    • Try exact match of normalized number
    • Fallback: Extract numeric components (e.g., "1234.65" → ["1234", "65"])
    • Find all matching components in OCR words
  3. For exact matches: Use word confidence directly
  4. For partial matches: Use symbol-level matching (digits only):
    • Filter to digits only (ignore commas, periods, currency symbols)
    • Compare digits position-by-position
    • Only matching digits contribute to confidence
    • Average confidence of matching digit symbols
  5. If multiple components found: Combine all components for symbol-level matching

Example:

  • Extracted: "1234.56"
  • OCR: "1,234.65" (split into ["1", "234", "65"])
  • Matched: "1234.65" (closest number)
  • Components: ["1234", "65"] → Both found in OCR
  • Symbol-level: Compare "123456" vs "123465" → Positions 0-3 match → Average confidence of matching digits

Date Type

Algorithm:

  1. Only calculate for exact matches (agreement score = 1.0)
  2. Find OCR word matching the matched date text:
    • Strategy 1: Try exact match as single word
    • Strategy 2: Word-by-word matching (tokenize by spaces)
  3. For word-by-word matching:
    • Match each component (year, month, day) individually
    • Use context disambiguation if multiple matches exist
    • Average confidence across all matched components
  4. Return average confidence

Example:

  • Extracted: "2024-01-15"
  • OCR: "2024 01 15" (spaces between components)
  • Components: ["2024", "01", "15"]
  • Match each component → Average their confidences

DateTime Type

Algorithm:
Similar to Date, but includes time components (hour, minute, second) in component matching.

Boolean Type

Algorithm:

  1. Only calculate for exact matches (agreement score = 1.0)
  2. Find exact OCR word match
  3. Return word confidence directly

Final Score Calculation

The final confidence score uses a weighted formula that adapts based on OCR agreement strength:

Formula Selection Logic

if (ocrAgreement == 0.0 && ocrConfidence == 0.0) {

    // No OCR data available

    score = (0.9 × LLM) + (0.1 × Parsing)

}

else if (ocrAgreement >= 0.8) {

    // Strong OCR agreement - trust OCR more

    score = (0.35 × LLM) + (0.25 × OCR Agreement) + (0.25 × OCR Confidence) + (0.15 × Parsing)

}

else {

    // Weak OCR agreement - trust LLM more

    score = (0.65 × LLM) + (0.15 × OCR Agreement) + (0.15 × OCR Confidence) + (0.05 × Parsing)

}

Rationale

  • No OCR: Relies primarily on LLM confidence (90% weight)
  • Strong OCR agreement (≥0.8): Balanced weighting across all components
  • Weak OCR agreement (<0.8): Heavily weighted toward LLM (65%) since OCR is unreliable

All scores are clamped to [0.0, 1.0] range.

Overall Extraction Confidence

Algorithm:

  1. For each field in the extraction schema:
    • If field has extracted values → Use its FinalScore from confidence breakdown
    • If field has no extracted values → Use 0.0
  2. Calculate average: overallConfidence = average(FinalScore for all schema fields)

Purpose: Provides a single confidence metric representing the quality of the entire extraction across all attributes.

Symbol-Level Matching Algorithm

Used for partial/fuzzy matches to calculate OCR confidence more accurately.

Process

  1. Filter characters: Extract only relevant characters (digits for numbers, all characters for strings)
  2. Position-by-position comparison:
    • Compare targetChars[i] with ocrChars[i]
    • If characters match → Add corresponding symbol confidence to collection
    • If characters don't match → Skip (don't add to confidence)
  3. Calculate average: confidence = average(matching symbol confidences)

Example: Number Partial Match

Extracted: "1234" → digits: ['1', '2', '3', '4']

OCR: "1235" → digits: ['1', '2', '3', '5']

OCR symbols: [('1', 0.98), ('2', 0.96), ('3', 0.94), ('5', 0.92)]

 

Position 0: '1' == '1' → Match → Add 0.98

Position 1: '2' == '2' → Match → Add 0.96

Position 2: '3' == '3' → Match → Add 0.94

Position 3: '4' != '5' → No match → Skip

 

Matching confidences: [0.98, 0.96, 0.94]

Final confidence: (0.98 + 0.96 + 0.94) / 3 = 0.96

Component Handling for Numbers

When numbers are split by formatting (commas, spaces, decimals), the system handles multiple components.

Algorithm

  1. Extract numeric components: Split on non-digits (e.g., "1234.65" → ["1234", "65"])
  2. Find matching components: Search for each component in OCR words
  3. Combine components: If multiple components found:
    • Concatenate digits in order
    • Concatenate symbol confidences in order
    • Create virtual OCR word with combined data
  4. Symbol-level matching: Use combined components for position-by-position comparison

Example

Extracted: "1234.56"

OCR words: ["1", "234", "65"] (split by comma and decimal)

Components: ["1234", "65"]

Matching: "234" and "65" found

Combined OCR: "23465" (digits from both components)

Combined symbols: [('2', 0.96), ('3', 0.94), ('4', 0.92), ('6', 0.93), ('5', 0.91)]

 

Symbol-level matching against "123456" (from extracted value)

Thresholds and Configuration

Agreement Thresholds

  • String: 0.75 (75% fuzzy match required)
  • Number: 0.5 (50% relative error acceptable)
  • Date: 0.67 (2 out of 3 components must match)
  • DateTime: 0.75 (higher precision required)

Number Relative Error Thresholds

  • Excellent: < 0.01 (1% error) → Score = 0.9
  • Good: < 0.05 (5% error) → Score = 0.8
  • Acceptable: < 0.1 (10% error) → Score = 0.5
  • Unacceptable: ≥ 0.1 → Score = 0.0

    • Related Articles

    • Section 10.2.3 Artificial Intelligence: Image Quality Analysis, Data Extraction, and Redaction

      10.2.3 Artificial Intelligence Section 10.2.3.1 AI Vision: Image Quality Analysis Overview AI Vision delivers automated image analysis at scale. It detects imperfections and ensures every image meets your enterprise quality benchmarks. Image Quality ...
    • WIB Review - Release 3.0.1

      Release 3.0.1 Item ID Epic Item Name Type Priority Tags Link WS-I2034 Artificial Intelligence Add AI result confidence score indicator during data extraction Task Medium ...
    • Section 11 Data Management

      Data Management Data Management allows administrators to watch sessions and keep data synchronized with changes to the configuration by rebuilding data at various levels. Section 11.2 Data Management Monitoring Levels Data Management is organized by ...
    • WIB Review - Release Notes

      Release Notes Note: This section details development features with a link for the corresponding instructions within this document. The feature reference guide was started with Release 2.3.0 and only contains references from that point forward. ...
    • Section 6.7 Image Viewer & Image Viewer Controls

      Section 6.7 Image Viewer & Image Viewer Controls There are zoom settings, search result settings, OCR highlights, image, and text feature sets in the Image Viewer. Section 6.7.1 Zoom Settings Section 6.7.1.1 Zoom In Zoom in as close as needed to ...