Overview
This PR introduces a comprehensive confidence scoring system
for AI data extraction. The system calculates detailed confidence scores for
each extracted attribute and an overall extraction confidence score, stores
them in both the database and JSON mappings, and displays them in the review
page UI.
Complete Confidence Calculation Flow
High-Level Flow
1. LLM Extraction
↓
2. Parse LLM Response (extract LLM confidence + parsing
confidence)
↓
3. OCR Agreement Calculation (compare extracted values with
OCR text)
↓
4. OCR Confidence Calculation (evaluate OCR quality for
matched text)
↓
5. Final Score Calculation (weighted combination of all
components)
↓
6. Overall Confidence Calculation (average across all schema
fields)
↓
7. Storage (JSON mapping + database)
↓
8. UI Display (review page widget)
Confidence Score Components
Each extracted attribute receives a ConfidenceBreakdown
containing five distinct scores:
1. LLM Score (0.0-1.0)
Source:
Confidence value directly from the LLM JSON response
Extraction:
Parsed during JSON parsing from the confidence property
Meaning:
Represents the LLM's own confidence in its extraction
2. Parsing Score (0.0-1.0)
3. OCR Agreement (0.0-1.0)
4. OCR Confidence (0.0-1.0)
Purpose:
Measures OCR quality for the matched text
Source:
Uses symbol-level confidence from Google Cloud Vision
Calculation:
Type-specific calculation methods (detailed below)
5. Final Score (0.0-1.0)
OCR Agreement Calculation by Data Type
String Type
Algorithm:
Normalize
both extracted value and OCR text (lowercase, trim, normalize whitespace)
Exact
match check: If normalized OCR contains normalized extracted value →
Score = 1.0
Fuzzy
match: Use Fuzz.PartialRatio to find best substring match
Threshold
check: If fuzzy score ≥ 0.75 → Return fuzzy score, else 0.0
Example:
Extracted:
"Invoice Number"
OCR:
"Document: Invoice Number 12345"
Result:
Exact match found → Score = 1.0
Number Type (Long/Double)
Algorithm:
Parse
extracted value (remove currency symbols, commas)
Extract
all numbers from OCR using regex:
[$€£]?\s*[-+]?\d{1,3}(?:,?\d{3})*(?:\.\d+)?
Exact
match: If any OCR number matches within tolerance (0.01) → Score = 1.0
Partial
match: Find closest number and calculate relative error:
Graduated
scoring:
relativeError
< 0.01 (1%) → Score = 0.9
relativeError
< 0.05 (5%) → Score = 0.8
relativeError
< 0.1 (10%) → Score = 0.5
relativeError
≥ 0.1 → Score = 0.0
Threshold
check: Only return score if ≥ 0.5
Example:
Extracted:
1234.56
OCR:
"1234.65" (decimal digits swapped)
Relative
error: 0.09 / 1234.65 = 0.000073 (0.0073%)
Result:
Score = 0.9 (excellent match)
Date Type
Algorithm:
Normalize
extracted date to yyyy-MM-dd format
Extract
all dates from OCR using multiple patterns:
Normalize
all OCR dates to yyyy-MM-dd
Exact
match: If normalized OCR date equals normalized extracted date → Score
= 1.0
Component
similarity: Compare year, month, day components:
No
match → Score = 0.0
Example:
DateTime Type
Algorithm:
Similar to Date, but includes time components (hour, minute, second). Requires
higher precision (threshold = 0.75).
Boolean Type
Algorithm:
Normalize
both values (lowercase, trim)
Exact
match only: If normalized values match → Score = 1.0
No
partial matching → Score = 0.0 if not exact
String Type
Algorithm:
Threshold
check: If agreement score < 0.75 (and not exact match) → Return 0.0
Find
OCR word matching the extracted value:
For
word-by-word matching:
Symbol-level
matching (for fuzzy matches):
Compare
characters position-by-position
Only
matching characters contribute to confidence
Average
confidence of matching character symbols
Return
average confidence across all words
Example:
Number Type (Long/Double)
Algorithm:
Threshold
check: If agreement score < 0.5 → Return 0.0
Find
OCR word matching the matched text:
Try
exact match of normalized number
Fallback:
Extract numeric components (e.g., "1234.65" →
["1234", "65"])
Find
all matching components in OCR words
For
exact matches: Use word confidence directly
For
partial matches: Use symbol-level matching (digits only):
Filter
to digits only (ignore commas, periods, currency symbols)
Compare
digits position-by-position
Only
matching digits contribute to confidence
Average
confidence of matching digit symbols
If
multiple components found: Combine all components for symbol-level
matching
Example:
Extracted:
"1234.56"
OCR:
"1,234.65" (split into ["1", "234",
"65"])
Matched:
"1234.65" (closest number)
Components:
["1234", "65"] → Both found in OCR
Symbol-level:
Compare "123456" vs "123465" → Positions 0-3 match →
Average confidence of matching digits
Date Type
Algorithm:
Only
calculate for exact matches (agreement score = 1.0)
Find
OCR word matching the matched date text:
For
word-by-word matching:
Match
each component (year, month, day) individually
Use
context disambiguation if multiple matches exist
Average
confidence across all matched components
Return
average confidence
Example:
Extracted:
"2024-01-15"
OCR:
"2024 01 15" (spaces between components)
Components:
["2024", "01", "15"]
Match
each component → Average their confidences
DateTime Type
Algorithm:
Similar to Date, but includes time components (hour, minute, second) in
component matching.
Boolean Type
Algorithm:
Only
calculate for exact matches (agreement score = 1.0)
Find
exact OCR word match
Return
word confidence directly
Final Score Calculation
The final confidence score uses a weighted formula that
adapts based on OCR agreement strength:
if (ocrAgreement == 0.0 && ocrConfidence == 0.0) {
// No OCR data
available
score = (0.9 ×
LLM) + (0.1 × Parsing)
}
else if (ocrAgreement >= 0.8) {
// Strong OCR
agreement - trust OCR more
score = (0.35 ×
LLM) + (0.25 × OCR Agreement) + (0.25 × OCR Confidence) + (0.15 × Parsing)
}
else {
// Weak OCR
agreement - trust LLM more
score = (0.65 ×
LLM) + (0.15 × OCR Agreement) + (0.15 × OCR Confidence) + (0.05 × Parsing)
}
Rationale
No
OCR: Relies primarily on LLM confidence (90% weight)
Strong
OCR agreement (≥0.8): Balanced weighting across all components
Weak
OCR agreement (<0.8): Heavily weighted toward LLM (65%) since OCR
is unreliable
All scores are clamped to [0.0, 1.0] range.
Algorithm:
For
each field in the extraction schema:
Calculate
average: overallConfidence = average(FinalScore for all schema fields)
Purpose: Provides a single confidence metric
representing the quality of the entire extraction across all attributes.
Symbol-Level Matching Algorithm
Used for partial/fuzzy matches to calculate OCR confidence
more accurately.
Process
Filter
characters: Extract only relevant characters (digits for numbers, all
characters for strings)
Position-by-position
comparison:
Compare
targetChars[i] with ocrChars[i]
If
characters match → Add corresponding symbol confidence to collection
If
characters don't match → Skip (don't add to confidence)
Calculate
average: confidence = average(matching symbol confidences)
Example: Number Partial Match
Extracted: "1234" → digits: ['1', '2', '3', '4']
OCR: "1235" → digits: ['1', '2', '3', '5']
OCR symbols: [('1', 0.98), ('2', 0.96), ('3', 0.94), ('5',
0.92)]
Position 0: '1' == '1' → Match → Add 0.98
Position 1: '2' == '2' → Match → Add 0.96
Position 2: '3' == '3' → Match → Add 0.94
Position 3: '4' != '5' → No match → Skip
Matching confidences: [0.98, 0.96, 0.94]
Final confidence: (0.98 + 0.96 + 0.94) / 3 = 0.96
Component Handling for Numbers
When numbers are split by formatting (commas, spaces,
decimals), the system handles multiple components.
Algorithm
Extract
numeric components: Split on non-digits (e.g., "1234.65" →
["1234", "65"])
Find
matching components: Search for each component in OCR words
Combine
components: If multiple components found:
Concatenate
digits in order
Concatenate
symbol confidences in order
Create
virtual OCR word with combined data
Symbol-level
matching: Use combined components for position-by-position comparison
Example
Extracted: "1234.56"
OCR words: ["1", "234", "65"]
(split by comma and decimal)
Components: ["1234", "65"]
Matching: "234" and "65" found
Combined OCR: "23465" (digits from both
components)
Combined symbols: [('2', 0.96), ('3', 0.94), ('4', 0.92),
('6', 0.93), ('5', 0.91)]
Symbol-level matching against "123456" (from
extracted value)
Thresholds and Configuration
Agreement Thresholds
String:
0.75 (75% fuzzy match required)
Number:
0.5 (50% relative error acceptable)
Date:
0.67 (2 out of 3 components must match)
DateTime:
0.75 (higher precision required)
Number Relative Error Thresholds
Excellent:
< 0.01 (1% error) → Score = 0.9
Good:
< 0.05 (5% error) → Score = 0.8
Acceptable:
< 0.1 (10% error) → Score = 0.5
Unacceptable:
≥ 0.1 → Score = 0.0