Let me trace through the relevant pipeline code.
Let me also check the Lua writer and talent converter.
Here's the full flow:
There are two sources, with a fallback chain:
rankings_fetcher.py:130-154)When fetching character rankings from WarcraftLogs, each ranking entry may include a talentImportCode field in combatantInfo. The _extract_talent_data function checks for it there first, then falls back to a top-level talentImportCode on the entry. This is the Blizzard-format base64 string the WoW client can consume.
Separately, each entry also has a talents array (list of {talentID, points} dicts). This gets hashed via talent_converter.py:create_talent_hash into a deterministic string like 12345:1|67890:2|... for grouping identical builds — this hash is not a valid import code.
talent_extractor.py)If a player's ranking entry has no talentImportCode but does have a report_code, the aggregator can fetch the actual combat report and extract the talent code from the fight data (reportData.report.fights[].talentImportCode). This happens in talent_aggregator.py:53-59 — but only for the most popular build if none of its players had an import code.
talent_aggregator.py groups players by their talent_hash (not the import code), counts occurrences, and picks the most popular. It then grabs the talent_import_code from any player in that group who has one. If nobody in the top build has a code, it triggers the fallback extractor. Builds without any import code are skipped entirely (line 69).
At the very end of the pipeline run — __main__.py:152:
This happens once, after all specs and both content types (raid + M+) are fully processed. lua_writer.py takes the list of SpecBuildResult objects, groups them by spec_id, and writes AzerothsTopTalent/TalentData.lua with entries keyed by numeric specID, each containing a talentCode string, sampleSize, and popularity per content type.
Good concern. The pipeline processes ~40 specs × 2 content types, making hundreds of API calls. If it dies at spec 35, you lose everything.
A few options, from simplest to most involved:
1. JSON checkpoint file (recommended)
After each spec+content_type is processed, append the SpecBuildResult to a JSON file on disk. On restart, load existing results and skip already-completed specs. This gives you:
2. Write Lua incrementally
Rewrite the Lua after each spec completes. Downside: you're rewriting the whole file each time, but the file is small so that's fine. Upside: you always have a usable (partial) output.
3. SQLite
Full relational store for rankings + builds. Overkill for this use case — you'd be adding a dependency and schema management for what's essentially a list of ~80 results.
My recommendation: option 1 + 2 combined. Save a JSON cache as the source of truth for resume capability, and also rewrite the Lua after each spec so you always have a usable output. The JSON cache acts as your checkpoint — on startup, load it, figure out what's already done, and skip those specs. Something like:
data_pipeline/
cache/
results.json # incremental checkpoint, list of SpecBuildResult dicts
AzerothsTopTalent/
TalentData.lua # rewritten after each spec completes
The key pieces would be:
SpecBuildResult to/from JSON (add to_dict/from_dict to models)_process_spec, append to the cache and rewrite Lua(spec_id, content_type) already done--fresh flag to ignore the cache and start overWant me to plan out the implementation?
Let me dig into the parts I haven't read yet — the models and config — to understand the serialization surface and where the output path lives.
Good, I've got the full picture now. The models are simple frozen dataclasses, 39 specs total, and the main loop iterates spec-by-spec with two content types each. Let me ask a clarifying question:
Got it — the cache will store its config context and auto-invalidate when zone IDs or top_n change.
Good. Now let me lay out the approaches:
A single .pipeline_cache.json at the project root. Structure:
How it works:
config doesn't match current settings, discard it.(spec_id, content_type) pairs already completed.TalentData.lua after each spec so you always have a usable partial output.--fresh flag to ignore/delete the cache.Pros: Simple, no new dependencies, human-readable, easy to debug.
Cons: Not atomic (crash mid-write could corrupt), but easily mitigated with write-to-temp-then-rename.
A .pipeline_cache.db with a results table.
Pros: Atomic writes built-in, queryable.
Cons: Overkill — we're storing ~80 rows of simple data. Adds sqlite3 usage complexity for no real gain. Harder to inspect.
Write cache/250_raid.json, cache/250_mythicplus.json, etc. Resume = glob for existing files.
Pros: No corruption risk (each file is independent), trivially parallel-safe.
Cons: Lots of small files, more filesystem clutter, harder to inspect overall state at a glance.
My recommendation is Approach 1 — a single JSON file with atomic writes (temp + rename). It's the right level of complexity for ~80 result entries. Does this direction feel right, or do you prefer one of the others?
Here's the design. I'll present it section by section.
.pipeline_cache.json at project root. Structure:
The config block stores the parameters that affect results. On startup, if current config doesn't match, the cache is discarded and the user is informed.
cache.pyA new data_pipeline/cache.py with three responsibilities:
load_cache(config) -> tuple[list[SpecBuildResult], set[tuple[int, str]]] — Loads the cache file, validates config, returns existing results and the set of (spec_id, content_type) keys already done. Returns empty if cache is missing, corrupt, or config-mismatched.
save_cache(config, results) — Serializes results + config to JSON. Writes to a temp file first, then os.replace() to the final path (atomic on POSIX).
clear_cache() — Deletes the cache file.
Serialization uses simple to_dict/from_dict classmethods added to SpecBuildResult and TalentBuild in models.py.
In __main__.py, the run() function changes to:
Key change: _process_spec currently returns results for both content types at once. We need to either split it so we can cache per content_type, or cache at the spec level. Since the inner loop is already for content_type, zone in ..., the cleanest approach is to move the content_type loop out of _process_spec and into run(), so each iteration processes one (spec, content_type) and can be individually cached and skipped.
--fresh flag: Calls clear_cache() before starting, forces a full run.--class / --spec / --encounter filters work as before — they just narrow which specs are processed. Cached results for other specs are preserved.Add .pipeline_cache.json to .gitignore.
Does this design look right? Anything you'd change before I write it up and move to implementation planning?
Design doc committed. Now let me invoke the writing-plans skill to create the implementation plan.
I'm using the writing-plans skill to create the implementation plan.
Let me check a few more things before writing the plan.
Now I have everything I need. Let me write the implementation plan.
Plan complete and saved to docs/plans/2026-02-16-incremental-cache-plan.md. Two execution options:
1. Subagent-Driven (this session) — I dispatch a fresh subagent per task, review between tasks, fast iteration
2. Parallel Session (separate) — Open a new session with executing-plans, batch execution with checkpoints
Which approach?
I'm using Subagent-Driven Development to execute this plan.
Let me read the prompt templates and extract the tasks.
Now let me set up the task tracking and extract the plan context.
Let me grab the base SHA before we start, then dispatch the first implementer.
Dispatching implementer for Task 1.
Task 1 implemented. Let me dispatch the spec reviewer.
Spec review passed. Now dispatching code quality reviewer.
Task 1 passed both reviews. Moving to Task 2.
Dispatching implementer for Task 2.
Task 2 implemented. Dispatching spec reviewer.
Spec review passed with a minor bug noted in the error handler. Let me fix that before the code quality review — it's a real issue.
Fixed the fd close bug. Now dispatching code quality reviewer for Task 2.
Code quality review approved. The reviewer noted two improvements: test isolation with tmp_path and docstrings. The test isolation point is valid — let me make a quick fix before moving on.
All 9 tests pass with isolated temp directories. Let me amend and move on.
Dispatching implementer for Task 3 — the main loop refactor.
Task 3 implemented. Dispatching spec reviewer.
Spec review passed. Dispatching code quality reviewer.