see docs/plans/2026-03-07-india-trade-revamp.md. start a new project in this folder (call it india-trade-analysis). use uv. start with the scraper. prefer to not use selenium if possible. https://tradestat.commerce.gov.in/eidb/commodity_wise_expor
Let me explore the website to understand how it works, so we can avoid Selenium/Playwright if possible.
Let me dig deeper into the site's actual API by examining the page source and JavaScript.
use cdp
The TRADESTAT page is already open. Let me inspect the form structure and network behavior.
This is a standard HTML form with a CSRF token — no need for Selenium/Playwright! We can use httpx with a session. Let me submit the form via the browser first to see the response format.
The results are showing! Let me see the table data:
It's paginated, showing 10 of 98 entries. Now let me check if we can do this purely with HTTP requests — let me try a direct POST with httpx:
TLS issue. Let me check if curl works and what TLS the server needs:
Curl works fine. The server uses TLS 1.3. The issue might be Python 3.14 SSL. Let me try with requests and a specific Python version:
All 98 rows in a single response — no pagination needed with HTTP requests. Now let me also check the country-wise endpoint:
Now let me get the full country list:
Now let me verify the country-wise endpoint works with direct HTTP too:
The server is finicky with SSL. The first attempt worked but now it's rejecting. Let me try using curl_cffi which provides browser-like TLS fingerprints:
curl_cffi works reliably. Now let me verify the country-wise bilateral endpoint too:
The server is rate-limiting. Let me test with a delay between requests:
Now let me check the more efficient endpoint — commodity_wise_all_countries_export:
Good — this endpoint takes an HS code, year, and value unit, and returns all countries for that commodity. Let me test it:
yo chill dawg. what does the plan say regarding data to collect?
You're right, let me step back. The plan says:
Data to collect:
Target schema:
| Column | Type |
|---|---|
hs_code | int8 (1–96) |
commodity | category |
country | category |
financial_year | category |
value_usd_mn | float32 |
Strategy: Country-first iteration → for each country, get all HS2 commodities for that country-year. ~3,318 requests (251 countries × 7 years × 2 trade types).
We've confirmed the key endpoint: country_wise_all_commodities_{export,import} works via plain HTTP POST with curl_cffi (no Selenium needed).
Want me to now set up the uv project and build the scraper?
yes start by setting up the project. i can tweak the specifics
Now let me get the full country list from the browser and build the config:
Good. Now let me build the project files:
Now let's test it:
It works! But the 419 errors (CSRF token expiry) show we need to refresh the token more aggressively. The retry logic handles it, but let me optimize — refresh the token after every POST since the server issues a new one in each response:
Now update the caller to use the returned token:
Let's re-run the test to confirm the token rotation eliminates the 419s:
Process already exited and log is empty. Something wrong with the module path. Let me debug:
Now let's re-run the test:
No more 419 errors with token rotation. Let's verify the output:
Everything looks clean. Let me check the final project structure:
Here's what's set up:
Project: india-trade-analysis/
india-trade-analysis/ ├── scraper/ │ ├── __init__.py │ ├── __main__.py # python -m scraper entry point │ ├── config.py # 251 countries, 7 years, endpoints, form fields │ └── scrape.py # HTTP scraper using curl_cffi (no Selenium) ├── data/ │ ├── raw/ # for failed.json │ └── processed/ # exports.parquet, imports.parquet ├── analysis/ # (empty, ready for notebooks) ├── pyproject.toml # uv project, deps: curl-cffi, lxml, pandas, pyarrow └── .python-version # 3.13
Key design choices:
curl_cffi with Chrome TLS impersonation — no Selenium/Playwright needed. The site is just a standard HTML form with CSRF tokens_token; we extract and reuse it, eliminating 419 errorsUsage:
The test run (6 requests) completed in ~20s with 422 rows, all correct dtypes, no failures. Ready for you to tweak and run the full scrape.
what are you doing with the raw country level data?
Looking at the plan:
Schema:
hs_code,commodity,country,financial_year,value_usd_mn
The country_wise_all_commodities endpoint returns a table with columns:
S.No. | HSCode | Commodity | <prev_year> | <curr_year> | %Growth
Right now parse_table takes column index 4 (the current year value) and throws away the previous year's value. That's wasteful — we're making 3,514 requests but only keeping half the data the server gives us.
We could grab both year columns from each response, which would give us overlapping coverage and let us cross-validate values. But it also means deduplication logic.
Alternatively — the simpler question: should I just keep doing what we're doing (one value per request, matching the plan's schema), or do you want to capture both years from each response?
what are you doing with the raw country level data? how is the scraper set up to track progress? are we throwing away the raw country level data after converting to parquet?
Good catches — three real problems with the current setup:
No raw data saved. We parse the HTML table in-memory and discard the response. If the parsing logic has a bug (wrong column index, edge case HTML), there's no way to re-parse without re-scraping all 3,514 pages.
No progress tracking / checkpointing. Everything accumulates in a list[dict] in memory. If the process dies at request 2,000, all 2,000 results are lost. There's no resume — you start over.
Single write at the end. save_results() only runs after the entire loop finishes. A 3,514-request scrape at ~2s each is ~2 hours — one crash and it's all gone.
What I'd suggest instead:
data/raw/{trade_type}/{year}/{country}.html — lets us re-parse later without re-scrapingWant me to rework the scraper with this approach?
yes - please plan the hardening of the scraper and make it really modular. i think an appraoch where i can specify country + year and it gets dispatched as a task could work? brainstorm with me