Overview
TalentID Global parses karting championship PDF result books into a versioned JSON format.
- Output directory:
data/out-json/v2/ (tabula engine) or data/out-json/v3/ (Claude Vision engine)
- Entry point:
python src/unified_parser.py [--file PATH | --championship FOLDER_OR_NUM]
- Tests:
pytest tests/test_output_json.py
CLI
# Parse a single file
python src/unified_parser.py --file data/in-pdf/01_FIAAcademy_FAC/2024/240425_R_FAC_DAR_ACA.pdf
# Parse an entire championship folder (full name or bare number)
python src/unified_parser.py --championship 01_FIAAcademy_FAC
python src/unified_parser.py --championship 5
# Parse all championships
python src/unified_parser.py
# Use the Claude Vision engine (outputs v3 JSON)
python src/unified_parser.py --engine claude --championship 01_FIAAcademy_FAC
Already-parsed files are skipped. Delete the output JSON to force a reparse. When using --engine claude, run only one process at a time — concurrent parses cause rate-limit contention and may corrupt output files.
JSON Schemas
v2 — Tabula Engine
Output mirrors data/in-pdf/ under data/out-json/v2/, with JSONs sitting flat inside the year folder (no per-event subfolder).
Session fields: rawSessionName, name, sessionNumber, type, laps (session lap count), results
Result fields: pos, name, nationality, laps, bestLapTime, bestLapNumber, delta, idealLapTime, gapToLeader, gapToNext, raceTime, bestS1/S2/S3, lapData
Lap fields: lapNumber, lapTime, s1, s2, s3, timeOfDay (omitted when empty)
v2.1 — Classification Extension
Output written to data/out-json/v2.1/ in parallel with v2/. Adds session-level fields:
| Field | Type | Notes |
|---|
maxLaps | integer | Total laps in the session |
sectorAnalysis | object | Replaces results and laps keys |
notes | string[] | Stewards’ decision strings stripped from footnote rows |
classification | object[] | Race-type sessions only (qualifyingHeat, preFinal, final) |
ClassificationEntry fields: pos, kartNo (integer string, e.g. "509"), driver, laps, gap, gapNoPenalties, penalty, posNoPenalties, positionsGained, firstLapOffset
v3 — Claude Vision Engine
Output at data/out-json/v3/<championship>/<year>/<code>.json. Version field: "version": "3.0".
{
"version": "3.0",
"event": {
"code": "240425_R_FAC_DAR_ACA",
"date": "2024-04-25",
"championship": "FAC",
"track": "DAR",
"category": "ACA"
},
"sessions": [
{
"rawSessionName": "Academy Final",
"name": "Final",
"sessionNumber": 1,
"type": "final",
"maxLaps": 18,
"notes": [],
"results": [
{
"pos": 1,
"kartNo": "509",
"name": "Smith, John",
"nationality": "United Kingdom",
"status": "classified",
"laps": 18,
"bestLapTime": 52.341,
"bestLapNumber": 7,
"delta": 0.0,
"gapToLeader": 0.0,
"raceTime": 987.654,
"firstLapOffset": 0.0,
"lapData": [
{ "lapNumber": 1, "lapTime": 58.123, "s1": null, "s2": null, "s3": null },
{ "lapNumber": 2, "lapTime": 52.900, "s1": 18.1, "s2": 16.4, "s3": 18.4 }
]
}
]
}
]
}
Valid status values: classified, notClassified, DNF, DSQ, retired
Valid type values: practice, qualifying, qualifyingHeat, preFinal, final, race
pos is omitted entirely for non-classified drivers. Do not use "pos": null — the field must be absent.
Architecture
Tabula Pipeline (v2 / v2.1)
The tabula pipeline reads PDF pages using tabula-py in stream mode. Championship-specific subclasses of the base parser handle layout differences in column detection, driver name extraction, and sector analysis.
Coordinate configuration lives in config/areas/*.yaml — one file per championship. The Championship dataclass loads all 16 championships from YAML at import time via Championship.from_yaml().
Key files:
| File | Purpose |
|---|
src/unified_parser.py | Entry point and CLI |
src/championship_areas.py | Loads championship YAML configs at import |
config/areas/*.yaml | Per-championship tabula coordinate definitions |
tests/test_output_json.py | v2 test suite (parametrised over all output JSONs) |
tests/test_output_json_v21.py | v2.1 test suite (11 tests) |
Claude Vision Pipeline (v3)
PDF → PageRenderer (renders pages to PNG base64)
→ PageClassifier (detects: EVENT_METADATA / RESULTS / LAP_ANALYSIS / SKIP)
→ ClaudeExtractor (sends image to Claude API via tool use)
→ ResponseValidator (validates + builds Pydantic v3 models)
→ SessionMerger
Pass 1: collect RESULTS sessions
Pass 2: merge lapData from LAP_ANALYSIS pages by kartNo
Pass 3: compute raceTime, idealLapTime, maxLaps, firstLapOffset
→ JSONWriter (writes to data/out-json/v3/)
Key files:
| File | Purpose |
|---|
src/claude_parser.py | Top-level orchestrator |
src/pdf_parsing/claude/page_classifier.py | Page type detection |
src/pdf_parsing/claude/extractor.py | Claude API calls via tool use |
src/pdf_parsing/claude/validator.py | Response validation; builds Pydantic models |
src/pdf_parsing/claude/merger.py | Session merging and derived field computation |
src/pdf_parsing/claude/writer.py | v3 JSON output |
src/pdf_parsing/models/v3.py | Pydantic v3 model classes |
API configuration:
| Page type | max_tokens | Rationale |
|---|
EVENT_METADATA | 1024 | Cover page metadata only |
RESULTS | 8192 | 34+ driver pages overflow 4096 |
LAP_ANALYSIS | 8192 | Dense lap tables; 4096 causes truncated responses |
Model: claude-sonnet-4-6. Rate limiting: max_retries=5, time.sleep(1) between pages (~60 req/min). Estimated cost: ~$3.70–4.50 per PDF.
Page cache (resume on interruption):
Each successful API response is cached immediately to data/out-json/v3/.page-cache/<championship>/<year>/<code>/<page_num:03d>.json. On restart, cached pages are loaded from disk — only uncached pages incur cost. The cache is never deleted automatically.
To re-run pipeline logic without hitting the API: delete only the output JSON, then reparse. The parser logs Resuming: N pages already cached at startup.
Key Design Decisions
| Decision | Rationale |
|---|
| Race time from lap times, not sector sums | Sectors are optional; lap time is mandatory. One missing sector no longer nullifies the entire race time. |
| Sector baseline = laps 2–N | Lap 1 (out-lap) is structurally different across championships — some omit S1. Excluding it makes sector analysis consistent. |
| Per-sector all-or-nothing validity | If any lap 2–N is missing a sector reading, that sector’s best is null. Prevents idealLapTime mixing partial and complete data. |
idealLapTime capped at bestLapTime | Sectors can come from different laps; their sum can exceed the best single lap after incidents. Cap prevents misleading values. |
| Consensus-based ESK column correction | S1/S2 swap applied only when all first laps show the shift pattern, avoiding false positives. |
| NaN placed last in sort order | nan < x always returns False in Python. Two-tuple key (isnan, value) is deterministic. |
| No per-event subfolder in v2 output | Reduces folder depth; all events for a year sit flat in the year directory. |
_JSON_VERSION constant | Single place to bump version; drives both folder name and JSON field. |
| Nationality as full country name | pycountry .name / .common_name is the canonical reference; more readable than ISO codes. |
Championship-Specific Parsing Rules
Column detection always uses header name, not position — column count and positions vary per championship and even per event.
| FAC | ESK | IES | BKC | FWC |
|---|
| Position column | First token of Rnk | First token of Rnk | Unnamed first column | First token of Rnk | First token of Rnk No |
| Kart number | No. | No. | No. | No. | Second token of Rnk No |
| Positions gained | Second token of Rnk | Second token of Rnk | Separate Rnk column | Second token of Rnk | Not present |
| Official time | Gap from leader | Gap from leader | Gap from leader | Direct Time column | Gap from leader |
| Laps column | Clean | Merged into Equipment | Clean | Clean | Clean |
| Penalty format | +30.000 | +10.000 | +5.000 | +5.00 | +10.000 |
| DNF format | "X Laps" in Gap | "X Laps" in Gap | "X Laps" in Gap | "X Laps" in Gap | "X Laps" or "Retired" |
| Not Classified | No | No | No | Literal row + DSQ below | Literal row + "Retired" |
| Footnote pattern | "No.NNN ..." | "No.NNN ..." | "No.NNN ..." | "No.NNN ..." | "NoNNN ..." (no dot) |
Shared rules (all championships):
- DNF / non-finisher: Gap contains
"Lap" or "Retired" → firstLapOffset = null
- Footnote row: first cell matches
^No\.?\d+ → strip row
"Not Classified" literal row: stop processing; skip all rows below
- Penalty: strip leading
+, parse as float; default 0.0 if empty; set offset null if unparseable
- Leader gap blank → treat as
0.0; all other positions with blank gap → None (parsing failure)
kartNo normalisation: pandas reads numeric columns as floats — always normalise with str(int(float(raw))) → "509"
Test Suites
| File | Covers |
|---|
tests/test_output_json.py | v2 — parametrised over every *.json under data/out-json/v2/ |
tests/test_output_json_v21.py | v2.1 — 11 tests |
tests/test_output_json_v3.py | v3 — parametrised over data/out-json/v3/**/*.json |
tests/test_v3_models.py | Unit tests for Pydantic v3 models |
tests/test_v3_writer.py | Unit tests for JSONWriter |
The sector sum check (test_sector_sum_equals_lap_time) is skipped when lapTime > sector_sum × 1.5 — neutralised/pit laps where sectors cover only the on-track portion. The _NEUTRALISED_LAP_RATIO = 1.5 constant in the test guards against false failures.
The v3 test suite requires at least one parsed JSON to exist in data/out-json/v3/. It skips cleanly if the directory is empty.
Gotchas
tabula stream mode column drift: On continuation pages, S3 values can land in Unnamed columns instead of Sector 3. The fix in clean_df_columns scans Unnamed columns as a fallback — do not remove it.
Cached output files: unified_parser.py skips existing JSONs. Always delete the output file before reparsing to see code changes take effect.
_canonical_session_name must stay in sync: This function exists in both claude_parser.py and merger.py. Both must apply the same normalisation or lap data silently fails to merge. Strips Results, Lap Time Analysis, Lap Analysis, and category prefixes (e.g. Academy , KZ ).
NaN sort behaviour: nan < x always returns False regardless of x. Always use (math.isnan(v), v) as the sort key when NaN is a possible value — plain min() / sort() with NaN produces undefined ordering.
Kart number exactly 6 chars: 'No.211' has 6 characters. The standard x[6:] slice returns ''. The driver name may appear in the adjacent Sector 2 column — the fallback to self.S2KEY handles this.
MM:SS lap times: Lap times over 60 s are displayed as MM:SS.sss (e.g. 1:08.783). System prompt rule 12 instructs Claude to convert to decimal seconds. If a wet-weather session shows systematic sector sum mismatches, check whether lapTime values are near 1.0x instead of 6x.
Multi-column lap truncation: LAP_ANALYSIS pages use a 3-column layout. Drivers whose lap block starts near the bottom of a column can have continuation rows dropped. Affected drivers have short raceTime and suppressed firstLapOffset. A full reparse is required to fix this.
firstLapOffset reliability guard: If the leader’s lap count differs from the modal lap count by even 1, all firstLapOffset values for that session are suppressed. The most common cause is multi-column lap truncation above.
Session order is reverse-chronological in FAC PDFs: claude_parser.py reverses the sessions list before numbering so session numbers ascend in race-day order. Different PDF ordering in other championships will produce inverted numbers.
Combined QP overview page: FAC events include a Qualifying Practice (QP) session aggregating both QPS1 and QPS2. _is_standings_session in claude_parser.py detects '(qp)' in the lowercased name and skips it. Do not remove this check.
Match-rate guard: Before computing firstLapOffset, the parser checks how many classification driver names appear in sector analysis. If fewer than 50% match, firstLapOffset is suppressed for all entries. This catches old FAC multi-heat combined tables (38 drivers vs a 12-driver per-heat sector analysis).
Old per-event subfolder JSONs: These still exist alongside new flat-structure v2 JSONs and have stale data. Tests pick up both. Consider deleting the subfolder files and reparsing cleanly.
Nationality failures (FAC 2015–2021): OCR-based nationality detection fails for these older PDFs. Affects ~20 events. Not yet investigated. In v3, <UNKNOWN> nationality is returned by Claude when it cannot read a nationality flag — the test suite skips any nationality value starting with <.
Next Steps
Claude Vision pipeline:
- Parse remaining FAC 2024 PDFs —
240620_R_FAC_SVK_ACA.pdf and 240801_R_FAC_KRI_ACA.pdf
- Verify multi-column lap truncation fix after next reparse (system prompt rule 11 strengthened)
- Confirm wet-session sector sum mismatches resolved after MM:SS conversion fix (system prompt rule 12)
v2.1 pipeline:
4. Investigate firstLapOffset outliers in firstLapOffset_outliers.csv — 284 entries outside [-5, 15]; large outliers (101.446, -32.382) pending investigation
5. Investigate 25 failing test_classification_driver_count_matches_sector_analysis cases (driver count off by 1–4)
6. Implement classification extraction for ESK, BKC, FWC — _extract_classification_df already written; calibrate results_area in each YAML first using scripts/probe_results_area.py
7. Add results_area coordinates to ESK, BKC, FWC YAMLs (probe script currently set to ESK)
General:
8. Delete old per-event subfolder JSONs and reparse all championships cleanly
9. Investigate FIA Academy nationality detection failure for 2015–2021 events