Skip to main content

Overview

TalentID Global parses karting championship PDF result books into a versioned JSON format.
  • Output directory: data/out-json/v2/ (tabula engine) or data/out-json/v3/ (Claude Vision engine)
  • Entry point: python src/unified_parser.py [--file PATH | --championship FOLDER_OR_NUM]
  • Tests: pytest tests/test_output_json.py

CLI

# Parse a single file
python src/unified_parser.py --file data/in-pdf/01_FIAAcademy_FAC/2024/240425_R_FAC_DAR_ACA.pdf

# Parse an entire championship folder (full name or bare number)
python src/unified_parser.py --championship 01_FIAAcademy_FAC
python src/unified_parser.py --championship 5

# Parse all championships
python src/unified_parser.py

# Use the Claude Vision engine (outputs v3 JSON)
python src/unified_parser.py --engine claude --championship 01_FIAAcademy_FAC
Already-parsed files are skipped. Delete the output JSON to force a reparse. When using --engine claude, run only one process at a time — concurrent parses cause rate-limit contention and may corrupt output files.

JSON Schemas

v2 — Tabula Engine

Output mirrors data/in-pdf/ under data/out-json/v2/, with JSONs sitting flat inside the year folder (no per-event subfolder). Session fields: rawSessionName, name, sessionNumber, type, laps (session lap count), results Result fields: pos, name, nationality, laps, bestLapTime, bestLapNumber, delta, idealLapTime, gapToLeader, gapToNext, raceTime, bestS1/S2/S3, lapData Lap fields: lapNumber, lapTime, s1, s2, s3, timeOfDay (omitted when empty)

v2.1 — Classification Extension

Output written to data/out-json/v2.1/ in parallel with v2/. Adds session-level fields:
FieldTypeNotes
maxLapsintegerTotal laps in the session
sectorAnalysisobjectReplaces results and laps keys
notesstring[]Stewards’ decision strings stripped from footnote rows
classificationobject[]Race-type sessions only (qualifyingHeat, preFinal, final)
ClassificationEntry fields: pos, kartNo (integer string, e.g. "509"), driver, laps, gap, gapNoPenalties, penalty, posNoPenalties, positionsGained, firstLapOffset

v3 — Claude Vision Engine

Output at data/out-json/v3/<championship>/<year>/<code>.json. Version field: "version": "3.0".
{
  "version": "3.0",
  "event": {
    "code": "240425_R_FAC_DAR_ACA",
    "date": "2024-04-25",
    "championship": "FAC",
    "track": "DAR",
    "category": "ACA"
  },
  "sessions": [
    {
      "rawSessionName": "Academy Final",
      "name": "Final",
      "sessionNumber": 1,
      "type": "final",
      "maxLaps": 18,
      "notes": [],
      "results": [
        {
          "pos": 1,
          "kartNo": "509",
          "name": "Smith, John",
          "nationality": "United Kingdom",
          "status": "classified",
          "laps": 18,
          "bestLapTime": 52.341,
          "bestLapNumber": 7,
          "delta": 0.0,
          "gapToLeader": 0.0,
          "raceTime": 987.654,
          "firstLapOffset": 0.0,
          "lapData": [
            { "lapNumber": 1, "lapTime": 58.123, "s1": null, "s2": null, "s3": null },
            { "lapNumber": 2, "lapTime": 52.900, "s1": 18.1, "s2": 16.4, "s3": 18.4 }
          ]
        }
      ]
    }
  ]
}
Valid status values: classified, notClassified, DNF, DSQ, retired Valid type values: practice, qualifying, qualifyingHeat, preFinal, final, race
pos is omitted entirely for non-classified drivers. Do not use "pos": null — the field must be absent.

Architecture

Tabula Pipeline (v2 / v2.1)

The tabula pipeline reads PDF pages using tabula-py in stream mode. Championship-specific subclasses of the base parser handle layout differences in column detection, driver name extraction, and sector analysis. Coordinate configuration lives in config/areas/*.yaml — one file per championship. The Championship dataclass loads all 16 championships from YAML at import time via Championship.from_yaml(). Key files:
FilePurpose
src/unified_parser.pyEntry point and CLI
src/championship_areas.pyLoads championship YAML configs at import
config/areas/*.yamlPer-championship tabula coordinate definitions
tests/test_output_json.pyv2 test suite (parametrised over all output JSONs)
tests/test_output_json_v21.pyv2.1 test suite (11 tests)

Claude Vision Pipeline (v3)

PDF → PageRenderer (renders pages to PNG base64)
    → PageClassifier (detects: EVENT_METADATA / RESULTS / LAP_ANALYSIS / SKIP)
    → ClaudeExtractor (sends image to Claude API via tool use)
    → ResponseValidator (validates + builds Pydantic v3 models)
    → SessionMerger
        Pass 1: collect RESULTS sessions
        Pass 2: merge lapData from LAP_ANALYSIS pages by kartNo
        Pass 3: compute raceTime, idealLapTime, maxLaps, firstLapOffset
    → JSONWriter (writes to data/out-json/v3/)
Key files:
FilePurpose
src/claude_parser.pyTop-level orchestrator
src/pdf_parsing/claude/page_classifier.pyPage type detection
src/pdf_parsing/claude/extractor.pyClaude API calls via tool use
src/pdf_parsing/claude/validator.pyResponse validation; builds Pydantic models
src/pdf_parsing/claude/merger.pySession merging and derived field computation
src/pdf_parsing/claude/writer.pyv3 JSON output
src/pdf_parsing/models/v3.pyPydantic v3 model classes
API configuration:
Page typemax_tokensRationale
EVENT_METADATA1024Cover page metadata only
RESULTS819234+ driver pages overflow 4096
LAP_ANALYSIS8192Dense lap tables; 4096 causes truncated responses
Model: claude-sonnet-4-6. Rate limiting: max_retries=5, time.sleep(1) between pages (~60 req/min). Estimated cost: ~$3.70–4.50 per PDF. Page cache (resume on interruption): Each successful API response is cached immediately to data/out-json/v3/.page-cache/<championship>/<year>/<code>/<page_num:03d>.json. On restart, cached pages are loaded from disk — only uncached pages incur cost. The cache is never deleted automatically. To re-run pipeline logic without hitting the API: delete only the output JSON, then reparse. The parser logs Resuming: N pages already cached at startup.

Key Design Decisions

DecisionRationale
Race time from lap times, not sector sumsSectors are optional; lap time is mandatory. One missing sector no longer nullifies the entire race time.
Sector baseline = laps 2–NLap 1 (out-lap) is structurally different across championships — some omit S1. Excluding it makes sector analysis consistent.
Per-sector all-or-nothing validityIf any lap 2–N is missing a sector reading, that sector’s best is null. Prevents idealLapTime mixing partial and complete data.
idealLapTime capped at bestLapTimeSectors can come from different laps; their sum can exceed the best single lap after incidents. Cap prevents misleading values.
Consensus-based ESK column correctionS1/S2 swap applied only when all first laps show the shift pattern, avoiding false positives.
NaN placed last in sort ordernan < x always returns False in Python. Two-tuple key (isnan, value) is deterministic.
No per-event subfolder in v2 outputReduces folder depth; all events for a year sit flat in the year directory.
_JSON_VERSION constantSingle place to bump version; drives both folder name and JSON field.
Nationality as full country namepycountry .name / .common_name is the canonical reference; more readable than ISO codes.

Championship-Specific Parsing Rules

Column detection always uses header name, not position — column count and positions vary per championship and even per event.
FACESKIESBKCFWC
Position columnFirst token of RnkFirst token of RnkUnnamed first columnFirst token of RnkFirst token of Rnk No
Kart numberNo.No.No.No.Second token of Rnk No
Positions gainedSecond token of RnkSecond token of RnkSeparate Rnk columnSecond token of RnkNot present
Official timeGap from leaderGap from leaderGap from leaderDirect Time columnGap from leader
Laps columnCleanMerged into EquipmentCleanCleanClean
Penalty format+30.000+10.000+5.000+5.00+10.000
DNF format"X Laps" in Gap"X Laps" in Gap"X Laps" in Gap"X Laps" in Gap"X Laps" or "Retired"
Not ClassifiedNoNoNoLiteral row + DSQ belowLiteral row + "Retired"
Footnote pattern"No.NNN ...""No.NNN ...""No.NNN ...""No.NNN ...""NoNNN ..." (no dot)
Shared rules (all championships):
  • DNF / non-finisher: Gap contains "Lap" or "Retired"firstLapOffset = null
  • Footnote row: first cell matches ^No\.?\d+ → strip row
  • "Not Classified" literal row: stop processing; skip all rows below
  • Penalty: strip leading +, parse as float; default 0.0 if empty; set offset null if unparseable
  • Leader gap blank → treat as 0.0; all other positions with blank gap → None (parsing failure)
  • kartNo normalisation: pandas reads numeric columns as floats — always normalise with str(int(float(raw)))"509"

Test Suites

FileCovers
tests/test_output_json.pyv2 — parametrised over every *.json under data/out-json/v2/
tests/test_output_json_v21.pyv2.1 — 11 tests
tests/test_output_json_v3.pyv3 — parametrised over data/out-json/v3/**/*.json
tests/test_v3_models.pyUnit tests for Pydantic v3 models
tests/test_v3_writer.pyUnit tests for JSONWriter
The sector sum check (test_sector_sum_equals_lap_time) is skipped when lapTime > sector_sum × 1.5 — neutralised/pit laps where sectors cover only the on-track portion. The _NEUTRALISED_LAP_RATIO = 1.5 constant in the test guards against false failures.
The v3 test suite requires at least one parsed JSON to exist in data/out-json/v3/. It skips cleanly if the directory is empty.

Gotchas

tabula stream mode column drift: On continuation pages, S3 values can land in Unnamed columns instead of Sector 3. The fix in clean_df_columns scans Unnamed columns as a fallback — do not remove it.
Cached output files: unified_parser.py skips existing JSONs. Always delete the output file before reparsing to see code changes take effect.
_canonical_session_name must stay in sync: This function exists in both claude_parser.py and merger.py. Both must apply the same normalisation or lap data silently fails to merge. Strips Results, Lap Time Analysis, Lap Analysis, and category prefixes (e.g. Academy , KZ ). NaN sort behaviour: nan < x always returns False regardless of x. Always use (math.isnan(v), v) as the sort key when NaN is a possible value — plain min() / sort() with NaN produces undefined ordering. Kart number exactly 6 chars: 'No.211' has 6 characters. The standard x[6:] slice returns ''. The driver name may appear in the adjacent Sector 2 column — the fallback to self.S2KEY handles this. MM:SS lap times: Lap times over 60 s are displayed as MM:SS.sss (e.g. 1:08.783). System prompt rule 12 instructs Claude to convert to decimal seconds. If a wet-weather session shows systematic sector sum mismatches, check whether lapTime values are near 1.0x instead of 6x. Multi-column lap truncation: LAP_ANALYSIS pages use a 3-column layout. Drivers whose lap block starts near the bottom of a column can have continuation rows dropped. Affected drivers have short raceTime and suppressed firstLapOffset. A full reparse is required to fix this. firstLapOffset reliability guard: If the leader’s lap count differs from the modal lap count by even 1, all firstLapOffset values for that session are suppressed. The most common cause is multi-column lap truncation above. Session order is reverse-chronological in FAC PDFs: claude_parser.py reverses the sessions list before numbering so session numbers ascend in race-day order. Different PDF ordering in other championships will produce inverted numbers. Combined QP overview page: FAC events include a Qualifying Practice (QP) session aggregating both QPS1 and QPS2. _is_standings_session in claude_parser.py detects '(qp)' in the lowercased name and skips it. Do not remove this check. Match-rate guard: Before computing firstLapOffset, the parser checks how many classification driver names appear in sector analysis. If fewer than 50% match, firstLapOffset is suppressed for all entries. This catches old FAC multi-heat combined tables (38 drivers vs a 12-driver per-heat sector analysis). Old per-event subfolder JSONs: These still exist alongside new flat-structure v2 JSONs and have stale data. Tests pick up both. Consider deleting the subfolder files and reparsing cleanly. Nationality failures (FAC 2015–2021): OCR-based nationality detection fails for these older PDFs. Affects ~20 events. Not yet investigated. In v3, <UNKNOWN> nationality is returned by Claude when it cannot read a nationality flag — the test suite skips any nationality value starting with <.

Next Steps

Claude Vision pipeline:
  1. Parse remaining FAC 2024 PDFs — 240620_R_FAC_SVK_ACA.pdf and 240801_R_FAC_KRI_ACA.pdf
  2. Verify multi-column lap truncation fix after next reparse (system prompt rule 11 strengthened)
  3. Confirm wet-session sector sum mismatches resolved after MM:SS conversion fix (system prompt rule 12)
v2.1 pipeline: 4. Investigate firstLapOffset outliers in firstLapOffset_outliers.csv — 284 entries outside [-5, 15]; large outliers (101.446, -32.382) pending investigation 5. Investigate 25 failing test_classification_driver_count_matches_sector_analysis cases (driver count off by 1–4) 6. Implement classification extraction for ESK, BKC, FWC — _extract_classification_df already written; calibrate results_area in each YAML first using scripts/probe_results_area.py 7. Add results_area coordinates to ESK, BKC, FWC YAMLs (probe script currently set to ESK) General: 8. Delete old per-event subfolder JSONs and reparse all championships cleanly 9. Investigate FIA Academy nationality detection failure for 2015–2021 events
Last modified on June 23, 2026