TalentID Global — PDF Parsing - Smedley Group Playbook

Overview

TalentID Global parses karting championship PDF result books into a versioned JSON format.

Output directory: data/out-json/v2/ (tabula engine) or data/out-json/v3/ (Claude Vision engine)
Entry point: python src/unified_parser.py [--file PATH | --championship FOLDER_OR_NUM]
Tests: pytest tests/test_output_json.py

CLI

# Parse a single file
python src/unified_parser.py --file data/in-pdf/01_FIAAcademy_FAC/2024/240425_R_FAC_DAR_ACA.pdf

# Parse an entire championship folder (full name or bare number)
python src/unified_parser.py --championship 01_FIAAcademy_FAC
python src/unified_parser.py --championship 5

# Parse all championships
python src/unified_parser.py

# Use the Claude Vision engine (outputs v3 JSON)
python src/unified_parser.py --engine claude --championship 01_FIAAcademy_FAC

Already-parsed files are skipped. Delete the output JSON to force a reparse. When using --engine claude, run only one process at a time — concurrent parses cause rate-limit contention and may corrupt output files.

JSON Schemas

v2 — Tabula Engine

Output mirrors data/in-pdf/ under data/out-json/v2/, with JSONs sitting flat inside the year folder (no per-event subfolder). Session fields: rawSessionName, name, sessionNumber, type, laps (session lap count), results Result fields: pos, name, nationality, laps, bestLapTime, bestLapNumber, delta, idealLapTime, gapToLeader, gapToNext, raceTime, bestS1/S2/S3, lapData Lap fields: lapNumber, lapTime, s1, s2, s3, timeOfDay (omitted when empty)

v2.1 — Classification Extension

Output written to data/out-json/v2.1/ in parallel with v2/. Adds session-level fields:

Field	Type	Notes
`maxLaps`	integer	Total laps in the session
`sectorAnalysis`	object	Replaces `results` and `laps` keys
`notes`	string[]	Stewards’ decision strings stripped from footnote rows
`classification`	object[]	Race-type sessions only (`qualifyingHeat`, `preFinal`, `final`)

ClassificationEntry fields: pos, kartNo (integer string, e.g. "509"), driver, laps, gap, gapNoPenalties, penalty, posNoPenalties, positionsGained, firstLapOffset

v3 — Claude Vision Engine

Output at data/out-json/v3/<championship>/<year>/<code>.json. Version field: "version": "3.0".

{
  "version": "3.0",
  "event": {
    "code": "240425_R_FAC_DAR_ACA",
    "date": "2024-04-25",
    "championship": "FAC",
    "track": "DAR",
    "category": "ACA"
  },
  "sessions": [
    {
      "rawSessionName": "Academy Final",
      "name": "Final",
      "sessionNumber": 1,
      "type": "final",
      "maxLaps": 18,
      "notes": [],
      "results": [
        {
          "pos": 1,
          "kartNo": "509",
          "name": "Smith, John",
          "nationality": "United Kingdom",
          "status": "classified",
          "laps": 18,
          "bestLapTime": 52.341,
          "bestLapNumber": 7,
          "delta": 0.0,
          "gapToLeader": 0.0,
          "raceTime": 987.654,
          "firstLapOffset": 0.0,
          "lapData": [
            { "lapNumber": 1, "lapTime": 58.123, "s1": null, "s2": null, "s3": null },
            { "lapNumber": 2, "lapTime": 52.900, "s1": 18.1, "s2": 16.4, "s3": 18.4 }
          ]
        }
      ]
    }
  ]
}

Valid status values: classified, notClassified, DNF, DSQ, retired Valid type values: practice, qualifying, qualifyingHeat, preFinal, final, race

pos is omitted entirely for non-classified drivers. Do not use "pos": null — the field must be absent.

Architecture

Tabula Pipeline (v2 / v2.1)

The tabula pipeline reads PDF pages using tabula-py in stream mode. Championship-specific subclasses of the base parser handle layout differences in column detection, driver name extraction, and sector analysis. Coordinate configuration lives in config/areas/*.yaml — one file per championship. The Championship dataclass loads all 16 championships from YAML at import time via Championship.from_yaml(). Key files:

File	Purpose
`src/unified_parser.py`	Entry point and CLI
`src/championship_areas.py`	Loads championship YAML configs at import
`config/areas/*.yaml`	Per-championship tabula coordinate definitions
`tests/test_output_json.py`	v2 test suite (parametrised over all output JSONs)
`tests/test_output_json_v21.py`	v2.1 test suite (11 tests)

Claude Vision Pipeline (v3)

PDF → PageRenderer (renders pages to PNG base64)
    → PageClassifier (detects: EVENT_METADATA / RESULTS / LAP_ANALYSIS / SKIP)
    → ClaudeExtractor (sends image to Claude API via tool use)
    → ResponseValidator (validates + builds Pydantic v3 models)
    → SessionMerger
        Pass 1: collect RESULTS sessions
        Pass 2: merge lapData from LAP_ANALYSIS pages by kartNo
        Pass 3: compute raceTime, idealLapTime, maxLaps, firstLapOffset
    → JSONWriter (writes to data/out-json/v3/)

Key files:

File	Purpose
`src/claude_parser.py`	Top-level orchestrator
`src/pdf_parsing/claude/page_classifier.py`	Page type detection
`src/pdf_parsing/claude/extractor.py`	Claude API calls via tool use
`src/pdf_parsing/claude/validator.py`	Response validation; builds Pydantic models
`src/pdf_parsing/claude/merger.py`	Session merging and derived field computation
`src/pdf_parsing/claude/writer.py`	v3 JSON output
`src/pdf_parsing/models/v3.py`	Pydantic v3 model classes

API configuration:

Page type	`max_tokens`	Rationale
`EVENT_METADATA`	1024	Cover page metadata only
`RESULTS`	8192	34+ driver pages overflow 4096
`LAP_ANALYSIS`	8192	Dense lap tables; 4096 causes truncated responses

Model: claude-sonnet-4-6. Rate limiting: max_retries=5, time.sleep(1) between pages (~60 req/min). Estimated cost: ~$3.70–4.50 per PDF. Page cache (resume on interruption): Each successful API response is cached immediately to data/out-json/v3/.page-cache/<championship>/<year>/<code>/<page_num:03d>.json. On restart, cached pages are loaded from disk — only uncached pages incur cost. The cache is never deleted automatically. To re-run pipeline logic without hitting the API: delete only the output JSON, then reparse. The parser logs Resuming: N pages already cached at startup.

Key Design Decisions

Decision	Rationale
Race time from lap times, not sector sums	Sectors are optional; lap time is mandatory. One missing sector no longer nullifies the entire race time.
Sector baseline = laps 2–N	Lap 1 (out-lap) is structurally different across championships — some omit S1. Excluding it makes sector analysis consistent.
Per-sector all-or-nothing validity	If any lap 2–N is missing a sector reading, that sector’s best is null. Prevents `idealLapTime` mixing partial and complete data.
`idealLapTime` capped at `bestLapTime`	Sectors can come from different laps; their sum can exceed the best single lap after incidents. Cap prevents misleading values.
Consensus-based ESK column correction	S1/S2 swap applied only when all first laps show the shift pattern, avoiding false positives.
NaN placed last in sort order	`nan < x` always returns `False` in Python. Two-tuple key `(isnan, value)` is deterministic.
No per-event subfolder in v2 output	Reduces folder depth; all events for a year sit flat in the year directory.
`_JSON_VERSION` constant	Single place to bump version; drives both folder name and JSON field.
Nationality as full country name	`pycountry` `.name` / `.common_name` is the canonical reference; more readable than ISO codes.

Championship-Specific Parsing Rules

Column detection always uses header name, not position — column count and positions vary per championship and even per event.

	FAC	ESK	IES	BKC	FWC
Position column	First token of `Rnk`	First token of `Rnk`	Unnamed first column	First token of `Rnk`	First token of `Rnk No`
Kart number	`No.`	`No.`	`No.`	`No.`	Second token of `Rnk No`
Positions gained	Second token of `Rnk`	Second token of `Rnk`	Separate `Rnk` column	Second token of `Rnk`	Not present
Official time	`Gap` from leader	`Gap` from leader	`Gap` from leader	Direct `Time` column	`Gap` from leader
Laps column	Clean	Merged into `Equipment`	Clean	Clean	Clean
Penalty format	`+30.000`	`+10.000`	`+5.000`	`+5.00`	`+10.000`
DNF format	`"X Laps"` in Gap	`"X Laps"` in Gap	`"X Laps"` in Gap	`"X Laps"` in Gap	`"X Laps"` or `"Retired"`
Not Classified	No	No	No	Literal row + DSQ below	Literal row + `"Retired"`
Footnote pattern	`"No.NNN ..."`	`"No.NNN ..."`	`"No.NNN ..."`	`"No.NNN ..."`	`"NoNNN ..."` (no dot)

Shared rules (all championships):

DNF / non-finisher: Gap contains "Lap" or "Retired" → firstLapOffset = null
Footnote row: first cell matches ^No\.?\d+ → strip row
"Not Classified" literal row: stop processing; skip all rows below
Penalty: strip leading +, parse as float; default 0.0 if empty; set offset null if unparseable
Leader gap blank → treat as 0.0; all other positions with blank gap → None (parsing failure)
kartNo normalisation: pandas reads numeric columns as floats — always normalise with str(int(float(raw))) → "509"

Test Suites

File	Covers
`tests/test_output_json.py`	v2 — parametrised over every `*.json` under `data/out-json/v2/`
`tests/test_output_json_v21.py`	v2.1 — 11 tests
`tests/test_output_json_v3.py`	v3 — parametrised over `data/out-json/v3/*/.json`
`tests/test_v3_models.py`	Unit tests for Pydantic v3 models
`tests/test_v3_writer.py`	Unit tests for `JSONWriter`

The sector sum check (test_sector_sum_equals_lap_time) is skipped when lapTime > sector_sum × 1.5 — neutralised/pit laps where sectors cover only the on-track portion. The _NEUTRALISED_LAP_RATIO = 1.5 constant in the test guards against false failures.

The v3 test suite requires at least one parsed JSON to exist in data/out-json/v3/. It skips cleanly if the directory is empty.

Gotchas

tabula stream mode column drift: On continuation pages, S3 values can land in Unnamed columns instead of Sector 3. The fix in clean_df_columns scans Unnamed columns as a fallback — do not remove it.

Cached output files: unified_parser.py skips existing JSONs. Always delete the output file before reparsing to see code changes take effect.

_canonical_session_name must stay in sync: This function exists in both claude_parser.py and merger.py. Both must apply the same normalisation or lap data silently fails to merge. Strips Results, Lap Time Analysis, Lap Analysis, and category prefixes (e.g. Academy , KZ ). NaN sort behaviour: nan < x always returns False regardless of x. Always use (math.isnan(v), v) as the sort key when NaN is a possible value — plain min() / sort() with NaN produces undefined ordering. Kart number exactly 6 chars: 'No.211' has 6 characters. The standard x[6:] slice returns ''. The driver name may appear in the adjacent Sector 2 column — the fallback to self.S2KEY handles this. MM:SS lap times: Lap times over 60 s are displayed as MM:SS.sss (e.g. 1:08.783). System prompt rule 12 instructs Claude to convert to decimal seconds. If a wet-weather session shows systematic sector sum mismatches, check whether lapTime values are near 1.0x instead of 6x. Multi-column lap truncation: LAP_ANALYSIS pages use a 3-column layout. Drivers whose lap block starts near the bottom of a column can have continuation rows dropped. Affected drivers have short raceTime and suppressed firstLapOffset. A full reparse is required to fix this. firstLapOffset reliability guard: If the leader’s lap count differs from the modal lap count by even 1, all firstLapOffset values for that session are suppressed. The most common cause is multi-column lap truncation above. Session order is reverse-chronological in FAC PDFs: claude_parser.py reverses the sessions list before numbering so session numbers ascend in race-day order. Different PDF ordering in other championships will produce inverted numbers. Combined QP overview page: FAC events include a Qualifying Practice (QP) session aggregating both QPS1 and QPS2. _is_standings_session in claude_parser.py detects '(qp)' in the lowercased name and skips it. Do not remove this check. Match-rate guard: Before computing firstLapOffset, the parser checks how many classification driver names appear in sector analysis. If fewer than 50% match, firstLapOffset is suppressed for all entries. This catches old FAC multi-heat combined tables (38 drivers vs a 12-driver per-heat sector analysis). Old per-event subfolder JSONs: These still exist alongside new flat-structure v2 JSONs and have stale data. Tests pick up both. Consider deleting the subfolder files and reparsing cleanly. Nationality failures (FAC 2015–2021): OCR-based nationality detection fails for these older PDFs. Affects ~20 events. Not yet investigated. In v3, <UNKNOWN> nationality is returned by Claude when it cannot read a nationality flag — the test suite skips any nationality value starting with <.

Next Steps

Claude Vision pipeline:

Parse remaining FAC 2024 PDFs — 240620_R_FAC_SVK_ACA.pdf and 240801_R_FAC_KRI_ACA.pdf
Verify multi-column lap truncation fix after next reparse (system prompt rule 11 strengthened)
Confirm wet-session sector sum mismatches resolved after MM:SS conversion fix (system prompt rule 12)

v2.1 pipeline: 4. Investigate firstLapOffset outliers in firstLapOffset_outliers.csv — 284 entries outside [-5, 15]; large outliers (101.446, -32.382) pending investigation 5. Investigate 25 failing test_classification_driver_count_matches_sector_analysis cases (driver count off by 1–4) 6. Implement classification extraction for ESK, BKC, FWC — _extract_classification_df already written; calibrate results_area in each YAML first using scripts/probe_results_area.py 7. Add results_area coordinates to ESK, BKC, FWC YAMLs (probe script currently set to ESK) General: 8. Delete old per-event subfolder JSONs and reparse all championships cleanly 9. Investigate FIA Academy nationality detection failure for 2015–2021 events

​Overview

​CLI

​JSON Schemas

​v2 — Tabula Engine

​v2.1 — Classification Extension

​v3 — Claude Vision Engine

​Architecture

​Tabula Pipeline (v2 / v2.1)

​Claude Vision Pipeline (v3)

​Key Design Decisions

​Championship-Specific Parsing Rules

​Test Suites

​Gotchas

​Next Steps

Overview

CLI

JSON Schemas

v2 — Tabula Engine

v2.1 — Classification Extension

v3 — Claude Vision Engine

Architecture

Tabula Pipeline (v2 / v2.1)

Claude Vision Pipeline (v3)

Key Design Decisions

Championship-Specific Parsing Rules

Test Suites

Gotchas

Next Steps