Set Up Modular Scraper for India Trade Data

Summary

see docs/plans/2026-03-07-india-trade-revamp.md. start a new project in this folder (call it india-trade-analysis). use uv. start with the scraper. prefer to not use selenium if possible. https://tradestat.commerce.gov.in/eidb/commodity_wise_expor

Read2026-03-07-india-trade-revamp.md

ReadSKILL.md

Now let me look at the website to understand its structure. The user wants me to avoid Selenium if possible. Let me first check the URL they provided and see if we can use httpx/requests instead of Playwright/Selenium.

Let me read the skill for tavily-extract to get the page content, or let me use the web-browser skill to inspect the page. Actually, let me first try to extract content from the URL to understand the form structure.

Let me explore the website to understand how it works, so we can avoid Selenium/Playwright if possible.

ReadSKILL.md

I can see the site structure. The form has Year, Commodity, HSCode, and Values dropdowns. The page uses JS-driven form submission. Let me try to see the actual network requests by looking at the page source, specifically the JavaScript. Let me try to extract the page source and JS files to understand the API endpoints.

Let me use the web-browser skill or directly try making HTTP requests to see if we can avoid Selenium/Playwright.

Let me dig deeper into the site's actual API by examining the page source and JavaScript.

ReadSKILL.md

use cdp

ReadSKILL.md

The TRADESTAT page is already open. Let me inspect the form structure and network behavior.

Now I can see the form structure. It's a standard POST form with:

_token (CSRF token)
EidbYearCwe (year value like "2024")
comType (radio: "all")
EidbComLevelCwe (digit level: "2", "4", "6", "8")
Eidb_hscodeCwe (specific HS code)
Eidb_ReportCwe (value unit: "2" for USD Million, "1" for ₹ Crore, "3" for Quantity)

Since it uses a CSRF token and POSTs to the same URL, I could potentially use httpx or requests instead of Playwright by extracting the token from a GET request first, then submitting the form data directly. Let me test submitting through the browser to see what the response looks like and whether a direct HTTP approach would work.

This is a standard HTML form with a CSRF token — no need for Selenium/Playwright! We can use httpx with a session. Let me submit the form via the browser first to see the response format.

The results are showing! Let me see the table data:

It's paginated, showing 10 of 98 entries. Now let me check if we can do this purely with HTTP requests — let me try a direct POST with httpx:

># Try a direct httpx request to see if we can skip the browser
python3 -c "
import httpx

# First GET to obtain session cookie + CSRF token
client = httpx.Client(follow_redirects=True, timeout=30)
r = client.get('https://tradestat.commerce.gov.in/eidb/commodity_wise_export')
print('Status:', r.status_code)
print('Cookies:', dict(client.cookies))

# Extract CSRF token
import re
m = re.search(r'name=\"_token\" value=\"([^\"]+)\"', r.text)
token = m.group(1) if m else 'NOT FOUND'
print('Token:', token)

# Now POST
data = {
    '_token': token,
    'EidbYearCwe': '2024',
    'comType': 'all',
    'EidbComLevelCwe': '2',
    'Eidb_ReportCwe': '2',
}
r2 = client.post('https://tradestat.commerce.gov.in/eidb/commodity_wise_export', data=data)
print('POST Status:', r2.status_code)
print('Has table:', '<table' in r2.text)
# Find number of rows
rows = re.findall(r'<tr>', r2.text)
print('Rows in response:', len(rows))
# Extract first few rows
import html
tables = re.findall(r'<tbody>(.*?)</tbody>', r2.text, re.DOTALL)
if tables:
    trs = re.findall(r'<tr>(.*?)</tr>', tables[0], re.DOTALL)
    print(f'Data rows: {len(trs)}')
    # Show first row
    if trs:
        tds = re.findall(r'<td[^>]*>(.*?)</td>', trs[0], re.DOTALL)
        print('First row:', [td.strip() for td in tds])
" 2>&1

>uv run --with httpx --with lxml python3 -c "
import httpx, re

# First GET to obtain session cookie + CSRF token
client = httpx.Client(follow_redirects=True, timeout=30)
r = client.get('https://tradestat.commerce.gov.in/eidb/commodity_wise_export')
print('GET Status:', r.status_code)
print('Cookies:', dict(client.cookies))

# Extract CSRF token
m = re.search(r'name=\"_token\" value=\"([^\"]+)\"', r.text)
token = m.group(1) if m else 'NOT FOUND'
print('Token:', token[:20], '...')

# Now POST
data = {
    '_token': token,
    'EidbYearCwe': '2024',
    'comType': 'all',
    'EidbComLevelCwe': '2',
    'Eidb_ReportCwe': '2',
}
r2 = client.post('https://tradestat.commerce.gov.in/eidb/commodity_wise_export', data=data)
print('POST Status:', r2.status_code)
print('Has table:', '<table' in r2.text)

# Parse with lxml
from lxml import html as lxml_html
doc = lxml_html.fromstring(r2.text)
rows = doc.xpath('//table//tbody//tr')
print(f'Data rows: {len(rows)}')
if rows:
    for row in rows[:3]:
        cells = [td.text_content().strip() for td in row.xpath('td')]
        print(cells)
    print('...')
    # Last row (total)
    cells = [td.text_content().strip() for td in rows[-1].xpath('td')]
    print('Last row:', cells)
" 2>&1

TLS issue. Let me check if curl works and what TLS the server needs:

Curl works fine. The server uses TLS 1.3. The issue might be Python 3.14 SSL. Let me try with requests and a specific Python version:

># Try with requests and a user-agent header
uv run --python 3.12 --with requests --with lxml python3 -c "
import requests, re

session = requests.Session()
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
})

# GET
r = session.get('https://tradestat.commerce.gov.in/eidb/commodity_wise_export', timeout=30)
print('GET Status:', r.status_code)
print('Cookies:', dict(session.cookies))

# Extract CSRF token
m = re.search(r'name=\"_token\" value=\"([^\"]+)\"', r.text)
token = m.group(1)
print('Token:', token[:20], '...')

# POST
data = {
    '_token': token,
    'EidbYearCwe': '2024',
    'comType': 'all',
    'EidbComLevelCwe': '2',
    'Eidb_ReportCwe': '2',
}
r2 = session.post('https://tradestat.commerce.gov.in/eidb/commodity_wise_export', data=data, timeout=30)
print('POST Status:', r2.status_code)
print('Response length:', len(r2.text))
print('Has table:', '<table' in r2.text)

from lxml import html
doc = html.fromstring(r2.text)
rows = doc.xpath('//table//tbody//tr')
print(f'Data rows: {len(rows)}')
if rows:
    for row in rows[:3]:
        cells = [td.text_content().strip() for td in row.xpath('td')]
        print(cells)
    print('...')
    cells = [td.text_content().strip() for td in rows[-1].xpath('td')]
    print('Last row:', cells)
" 2>&1

All 98 rows in a single response — no pagination needed with HTTP requests. Now let me also check the country-wise endpoint:

Now let me get the full country list:

>CDP="/Users/lakshyaag/.pi/agent/git/github.com/pasky/chrome-cdp-skill/skills/chrome-cdp/scripts/cdp.mjs"
$CDP eval C9A48549 'JSON.stringify(Array.from(document.querySelectorAll("#EidbCntcwace option")).map(o => ({value: o.value, text: o.textContent.trim()})))' 2>&1 | python3 -c "
import json, sys
data = json.loads(sys.stdin.read())
print(f'Total countries: {len(data)}')
for d in data[:5]:
    print(d)
print('...')
for d in data[-5:]:
    print(d)
" 2>&1

>CDP="/Users/lakshyaag/.pi/agent/git/github.com/pasky/chrome-cdp-skill/skills/chrome-cdp/scripts/cdp.mjs"
# Save full country list
$CDP eval C9A48549 'JSON.stringify(Array.from(document.querySelectorAll("#EidbCntcwace option")).map(o => ({value: o.value, text: o.textContent.trim()})))' 2>&1 > /tmp/countries.json

# Also check the rest of the form - HS level selector and value unit
$CDP eval C9A48549 'document.querySelector("form").outerHTML.substring(3500, 7000)' 2>&1

>CDP="/Users/lakshyaag/.pi/agent/git/github.com/pasky/chrome-cdp-skill/skills/chrome-cdp/scripts/cdp.mjs"
# Get the form fields - selects and their names
$CDP eval C9A48549 'JSON.stringify(Array.from(document.querySelectorAll("form select")).map(s => ({name: s.name, id: s.id, options: Array.from(s.options).map(o => ({value: o.value, text: o.textContent.trim()}))})))' 2>&1 | python3 -c "
import json, sys
data = json.loads(sys.stdin.read())
for sel in data:
    print(f'\n--- {sel[\"name\"]} (id={sel[\"id\"]}) ---')
    # For country, just show count
    if len(sel['options']) > 20:
        print(f'  {len(sel[\"options\"])} options')
        print(f'  First: {sel[\"options\"][0]}')
        print(f'  Last: {sel[\"options\"][-1]}')
    else:
        for o in sel['options']:
            print(f'  {o}')
" 2>&1

Now let me verify the country-wise endpoint works with direct HTTP too:

>uv run --python 3.12 --with requests --with lxml python3 -c "
import requests, re
from lxml import html

session = requests.Session()
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
})

# GET country-wise all commodities export
url = 'https://tradestat.commerce.gov.in/eidb/country_wise_all_commodities_export'
r = session.get(url, timeout=30)
m = re.search(r'name=\"_token\" value=\"([^\"]+)\"', r.text)
token = m.group(1)

# POST for USA (value='403,U S A'), year 2024, HS2, USD Million
data = {
    '_token': token,
    'EidbYearcwace': '2024',
    'EidbCntcwace': '403,U S A',
    'EidbReportcwace': '2',
    'EidbComLevelcwace': '2',
}
r2 = session.post(url, data=data, timeout=30)
print('POST Status:', r2.status_code)

doc = html.fromstring(r2.text)
rows = doc.xpath('//table//tbody//tr')
print(f'Data rows: {len(rows)}')
# Print headers
headers = doc.xpath('//table//thead//th')
print('Headers:', [h.text_content().strip() for h in headers])
# Print first 3 rows
for row in rows[:3]:
    cells = [td.text_content().strip() for td in row.xpath('td')]
    print(cells)
print('...')
# Last row
cells = [td.text_content().strip() for td in rows[-1].xpath('td')]
print('Last:', cells)
" 2>&1

It seems the server is rejecting connections from Python. The first request worked but subsequent ones fail. Maybe there's rate limiting, or the server's SSL is finicky. Let me try with curl subprocess approach, or use pycurl, or try with a custom SSL context.

Actually wait - the first test with requests worked fine (Python 3.12 with requests and lxml). But this second one failed. Let me check - maybe the server has some aggressive connection reset for repeated requests. Or maybe I need to use a different approach for TLS.

Let me try with curl_cffi which provides browser-like TLS fingerprints, or try subprocess with curl.

The server is finicky with SSL. The first attempt worked but now it's rejecting. Let me try using curl_cffi which provides browser-like TLS fingerprints:

># Try with curl_cffi which impersonates browser TLS
uv run --python 3.12 --with curl_cffi --with lxml python3 -c "
from curl_cffi import requests
import re
from lxml import html

session = requests.Session(impersonate='chrome')

# GET commodity-wise export
url = 'https://tradestat.commerce.gov.in/eidb/commodity_wise_export'
r = session.get(url, timeout=30)
print('GET Status:', r.status_code)

# Extract CSRF token
m = re.search(r'name=\"_token\" value=\"([^\"]+)\"', r.text)
token = m.group(1)
print('Token:', token[:20], '...')

# POST
data = {
    '_token': token,
    'EidbYearCwe': '2024',
    'comType': 'all',
    'EidbComLevelCwe': '2',
    'Eidb_ReportCwe': '2',
}
r2 = session.post(url, data=data, timeout=30)
print('POST Status:', r2.status_code)

doc = html.fromstring(r2.text)
rows = doc.xpath('//table//tbody//tr')
print(f'Data rows: {len(rows)}')
for row in rows[:3]:
    cells = [td.text_content().strip() for td in row.xpath('td')]
    print(cells)
" 2>&1

curl_cffi works reliably. Now let me verify the country-wise bilateral endpoint too:

>uv run --python 3.12 --with curl_cffi --with lxml python3 -c "
from curl_cffi import requests
import re
from lxml import html

session = requests.Session(impersonate='chrome')

# Country-wise all commodities - bilateral data
url = 'https://tradestat.commerce.gov.in/eidb/country_wise_all_commodities_export'
r = session.get(url, timeout=30)
m = re.search(r'name=\"_token\" value=\"([^\"]+)\"', r.text)
token = m.group(1)

# POST for USA
data = {
    '_token': token,
    'EidbYearcwace': '2024',
    'EidbCntcwace': '403,U S A',
    'EidbReportcwace': '2',
    'EidbComLevelcwace': '2',
}
r2 = session.post(url, data=data, timeout=30)
print('Status:', r2.status_code)

doc = html.fromstring(r2.text)
headers = doc.xpath('//table//thead//th')
print('Headers:', [h.text_content().strip() for h in headers])

rows = doc.xpath('//table//tbody//tr')
print(f'Data rows: {len(rows)}')
for row in rows[:5]:
    cells = [td.text_content().strip() for td in row.xpath('td')]
    print(cells)
print('...')
cells = [td.text_content().strip() for td in rows[-1].xpath('td')]
print('Last:', cells)
" 2>&1

The server is rate-limiting. Let me test with a delay between requests:

>uv run --python 3.12 --with curl_cffi --with lxml python3 -c "
from curl_cffi import requests
import re, time
from lxml import html

session = requests.Session(impersonate='chrome')

# Test with delay and retry
for attempt in range(3):
    try:
        url = 'https://tradestat.commerce.gov.in/eidb/country_wise_all_commodities_export'
        r = session.get(url, timeout=30)
        print(f'GET attempt {attempt+1}: Status {r.status_code}')
        break
    except Exception as e:
        print(f'GET attempt {attempt+1} failed: {e}')
        time.sleep(3)
        session = requests.Session(impersonate='chrome')

m = re.search(r'name=\"_token\" value=\"([^\"]+)\"', r.text)
token = m.group(1)
print('Token OK')

time.sleep(2)

data = {
    '_token': token,
    'EidbYearcwace': '2024',
    'EidbCntcwace': '403,U S A',
    'EidbReportcwace': '2',
    'EidbComLevelcwace': '2',
}
r2 = session.post(url, data=data, timeout=30)
print('POST Status:', r2.status_code)

doc = html.fromstring(r2.text)
headers = doc.xpath('//table//thead//th')
print('Headers:', [h.text_content().strip() for h in headers])
rows = doc.xpath('//table//tbody//tr')
print(f'Data rows: {len(rows)}')
for row in rows[:3]:
    cells = [td.text_content().strip() for td in row.xpath('td')]
    print(cells)
print('...')
cells = [td.text_content().strip() for td in rows[-1].xpath('td')]
print('Last:', cells)
" 2>&1

Hmm, that's only 32 rows for the country-wise endpoint (USA). This is the "country-wise all commodities" report which seems to only show commodities that had trade with that specific country. But I'm only getting 32 HS2 items out of 98. Wait, looking more closely - those are HS2 codes but they jump from 13 to 17 to 19. So this report only shows HS codes where trade exists.

Actually, I notice the headers show 2023-2024 and 2024-2025 - it's showing two years in one response. That's different from what I expected.

I'm realizing I might be looking at the wrong endpoint for what we need. The "Country-wise All Commodities" endpoint should give us bilateral data with exports and imports per country, HS2 code, and year, but the USA data is only returning 32 rows when there should be significantly more. Let me dig into the actual values to see what's happening - some entries appear to show zero values which might be filtering them out.

I should explore the other available endpoints to find the most efficient approach. The commodity_wise_all_countries_export endpoint could work as an alternative, giving me all countries for a specific commodity. Now I'm weighing the trade-offs between three main strategies: querying by commodity alone (7 requests), by country and year (1,757 requests), or by commodity and year (~686 requests).

Since I also need import data, the numbers double, making the commodity-based approach significantly more efficient at around 1,372 total requests versus over 3,500 for the country-based approach. Let me verify that endpoint works as expected.

But I'm noticing something odd—the USA test returned only 32 rows for 2024, which seems surprisingly low for all commodities exported by a single country. I should investigate whether that's a data limitation or if I'm hitting the wrong endpoint.

Actually, looking back at the commodity-wise results, I got 98 rows for all HS2 chapters, so it appears the country-wise endpoint filters out zero-trade commodities while the commodity-wise one includes all chapters. That's workable for what I need.

Now I have enough clarity on the approach: I'll use CSRF tokens via GET requests, leverage curl_cffi for browser impersonation to handle TLS, respect the server's rate limiting with delays, and build around the commodity-wise endpoints since they're more efficient for gathering export and import values across all countries and years.

Now let me check the more efficient endpoint — commodity_wise_all_countries_export:

Good — this endpoint takes an HS code, year, and value unit, and returns all countries for that commodity. Let me test it:

>sleep 3
uv run --python 3.12 --with curl_cffi --with lxml python3 -c "
from curl_cffi import requests
import re
from lxml import html

session = requests.Session(impersonate='chrome')

url = 'https://tradestat.commerce.gov.in/eidb/commodity_wise_all_countries_export'
r = session.get(url, timeout=30)
m = re.search(r'name=\"_token\" value=\"([^\"]+)\"', r.text)
token = m.group(1)

import time; time.sleep(1)

# POST for HS code 27 (mineral fuels/petroleum), year 2024, USD Million
data = {
    '_token': token,
    'Eidbhscode_cmace': '27',
    'EidbYear_cmace': '2024',
    'EidbReport_cmace': '2',
}
r2 = session.post(url, data=data, timeout=30)
print('Status:', r2.status_code)

doc = html.fromstring(r2.text)
headers = doc.xpath('//table//thead//th')
print('Headers:', [h.text_content().strip() for h in headers])
rows = doc.xpath('//table//tbody//tr')
print(f'Data rows: {len(rows)}')
for row in rows[:5]:
    cells = [td.text_content().strip() for td in row.xpath('td')]
    print(cells)
print('...')
cells = [td.text_content().strip() for td in rows[-1].xpath('td')]
print('Last:', cells)
" 2>&1

Interesting. The commodity_wise_all_countries_export endpoint gives us all countries for a given HS code and year. But it has 8 columns including both values in USD and quantity. The header structure shows it gives both 2023-2024 and 2024-2025 data in one response.

Wait, looking at the headers more carefully:

S.No.
Country / Region
Values in US $ Million (2023-2024)
Values in Quantity (2024-2025)
%Growth
2023-2024 (quantity?)
2024-2025 (quantity?)
%Growth (quantity?)

That's a confusing structure with what looks like sub-headers. Let me check the actual columns more carefully. The data shows 8 cells per row, and looking at the first row, I'm seeing values like '1', 'AFGHANISTAN', '0.19', '0.40' — so it seems like the columns are mixing value and quantity data across different time periods in a way that's not immediately clear from the headers.

With 8 columns of actual data but 10 headers listed, this is definitely a multi-row header situation. I think the real structure is probably: S.No., Country, USD Value 2023-2024, USD Value 2024-2025, USD %Growth, Quantity 2023-2024, Quantity 2024-2025, and Quantity %Growth. That would align the 8 data columns with a logical grouping of financial and quantity metrics across the two time periods.

yo chill dawg. what does the plan say regarding data to collect?

The user wants me to re-read the plan and focus on what data we actually need to collect, rather than exploring every endpoint. Let me re-check the plan.

From the plan:

Scope:

Scrape EIDB (Annual, HS2, country-commodity bilateral) for FY 2018-19 → 2024-25
Store data efficiently in Parquet

Key data needed:

HS2-level commodity data by country (bilateral)
Both exports and imports
FY 2018-19 → 2024-25 (7 years)
Values in USD Million

Now I'm looking at the schema I need — HS2 codes as integers, commodity and country as categories for compression, financial year as a category, and values stored as float32. The strategy is to iterate through each country and pull all HS2 commodities for that country-year combination, which means about 3,318 API requests total across 237 countries, 7 years, and both export/import directions. The endpoint I should use is country_wise_all_commodities.

I'm noticing there are actually three endpoints available — the country-first approach I planned, plus a commodity-first alternative that would cut requests down to around 1,372. The commodity-first method would be more efficient, but I'll stick with the country-first strategy as outlined in the plan.

You're right, let me step back. The plan says:

Data to collect:

EIDB Annual, HS2-level, country-commodity bilateral for FY 2018-19 → 2024-25
Both exports and imports
Values in USD Million

Target schema:

Column	Type
`hs_code`	int8 (1–96)
`commodity`	category
`country`	category
`financial_year`	category
`value_usd_mn`	float32

Strategy: Country-first iteration → for each country, get all HS2 commodities for that country-year. ~3,318 requests (251 countries × 7 years × 2 trade types).

We've confirmed the key endpoint: country_wise_all_commodities_{export,import} works via plain HTTP POST with curl_cffi (no Selenium needed).

Want me to now set up the uv project and build the scraper?

yes start by setting up the project. i can tweak the specifics

Now let me get the full country list from the browser and build the config:

>CDP="/Users/lakshyaag/.pi/agent/git/github.com/pasky/chrome-cdp-skill/skills/chrome-cdp/scripts/cdp.mjs"
# Need to navigate back to the country_wise page first
$CDP nav C9A48549 'https://tradestat.commerce.gov.in/eidb/country_wise_all_commodities_export' 2>&1
sleep 3
$CDP eval C9A48549 'JSON.stringify(Array.from(document.querySelectorAll("#EidbCntcwace option")).map(o => ({value: o.value, text: o.textContent.trim()})))' 2>&1 > /tmp/countries_full.json
python3 -c "import json; d=json.load(open('/tmp/countries_full.json')); print(f'{len(d)} countries'); print(d[:3]); print(d[-3:])"

Good. Now let me build the project files:

Created__init__.py

Now let's test it:

It works! But the 419 errors (CSRF token expiry) show we need to refresh the token more aggressively. The retry logic handles it, but let me optimize — refresh the token after every POST since the server issues a new one in each response:

Editedscrape.py

Now update the caller to use the returned token:

Editedscrape.py

Let's re-run the test to confirm the token rotation eliminates the 419s:

Process already exited and log is empty. Something wrong with the module path. Let me debug:

Now let's re-run the test:

No more 419 errors with token rotation. Let's verify the output:

>cd /Users/lakshyaag/Projects/Personal/India-Trade-Revamp/india-trade-analysis && uv run python -c "
import pandas as pd
df = pd.read_parquet('data/processed/exports.parquet')
print(df.dtypes)
print()
print(df.shape)
print()
print(df.head(10).to_string())
print()
print('Countries:', df['country'].unique().tolist())
print('Years:', df['financial_year'].unique().tolist())
print('HS codes range:', df['hs_code'].min(), '-', df['hs_code'].max())
print('Value range:', df['value_usd_mn'].min(), '-', df['value_usd_mn'].max())
print('Nulls:', df['value_usd_mn'].isna().sum())
" 2>&1

Everything looks clean. Let me check the final project structure:

Here's what's set up:

Project: india-trade-analysis/

india-trade-analysis/├── scraper/│   ├── __init__.py│   ├── __main__.py          # python -m scraper entry point│   ├── config.py            # 251 countries, 7 years, endpoints, form fields│   └── scrape.py            # HTTP scraper using curl_cffi (no Selenium)├── data/│   ├── raw/                 # for failed.json│   └── processed/           # exports.parquet, imports.parquet├── analysis/                # (empty, ready for notebooks)├── pyproject.toml           # uv project, deps: curl-cffi, lxml, pandas, pyarrow└── .python-version          # 3.13

india-trade-analysis/├── scraper/│   ├── __init__.py│   ├── __main__.py          # python -m scraper entry point│   ├── config.py            # 251 countries, 7 years, endpoints, form fields│   └── scrape.py            # HTTP scraper using curl_cffi (no Selenium)├── data/│   ├── raw/                 # for failed.json│   └── processed/           # exports.parquet, imports.parquet├── analysis/                # (empty, ready for notebooks)├── pyproject.toml           # uv project, deps: curl-cffi, lxml, pandas, pyarrow└── .python-version          # 3.13

Key design choices:

curl_cffi with Chrome TLS impersonation — no Selenium/Playwright needed. The site is just a standard HTML form with CSRF tokens
Token rotation — each POST response contains a fresh _token; we extract and reuse it, eliminating 419 errors
Retry with session reset — on connection errors (the server is flaky), creates a fresh session + re-fetches CSRF
1.5s delay between requests to be polite

Usage:

bash

uv run python -m scraper.scrape --test    # 3 countries × 2 years × exportuv run python -m scraper.scrape           # full scrape (251 × 7 × 2 = 3,514 requests)uv run python -m scraper.scrape --retry   # retry failures from failed.json

uv run python -m scraper.scrape --test    # 3 countries × 2 years × exportuv run python -m scraper.scrape           # full scrape (251 × 7 × 2 = 3,514 requests)uv run python -m scraper.scrape --retry   # retry failures from failed.json

The test run (6 requests) completed in ~20s with 422 rows, all correct dtypes, no failures. Ready for you to tweak and run the full scrape.

what are you doing with the raw country level data?

Looking at the plan:

Schema: hs_code, commodity, country, financial_year, value_usd_mn

The country_wise_all_commodities endpoint returns a table with columns:

Right now parse_table takes column index 4 (the current year value) and throws away the previous year's value. That's wasteful — we're making 3,514 requests but only keeping half the data the server gives us.

We could grab both year columns from each response, which would give us overlapping coverage and let us cross-validate values. But it also means deduplication logic.

Alternatively — the simpler question: should I just keep doing what we're doing (one value per request, matching the plan's schema), or do you want to capture both years from each response?

what are you doing with the raw country level data? how is the scraper set up to track progress? are we throwing away the raw country level data after converting to parquet?

The user is raising valid concerns:

Raw data preservation - Currently we're not saving any raw HTML or intermediate data. We parse the table in memory and only save the final parquet. If something goes wrong with parsing, we lose everything.
Progress tracking - Currently there's no checkpointing. If the scrape crashes at request 2000/3514, we start from scratch. There's no way to resume.
Data pipeline - We're going straight from HTTP response → parse → in-memory list → parquet at the end. If the process dies, all in-memory data is lost.

These are real architectural problems that need addressing before we jump into solutions.

Good catches — three real problems with the current setup:

No raw data saved. We parse the HTML table in-memory and discard the response. If the parsing logic has a bug (wrong column index, edge case HTML), there's no way to re-parse without re-scraping all 3,514 pages.
No progress tracking / checkpointing. Everything accumulates in a list[dict] in memory. If the process dies at request 2,000, all 2,000 results are lost. There's no resume — you start over.
Single write at the end. save_results() only runs after the entire loop finishes. A 3,514-request scrape at ~2s each is ~2 hours — one crash and it's all gone.

What I'd suggest instead:

Save raw HTML per response → data/raw/{trade_type}/{year}/{country}.html — lets us re-parse later without re-scraping
Append to per-(trade_type, year) CSV/parquet after each country — incremental writes, crash-safe
Track completed (trade_type, year, country) tuples in a simple JSON or SQLite progress file — on restart, skip already-done combos

Want me to rework the scraper with this approach?

yes - please plan the hardening of the scraper and make it really modular. i think an appraoch where i can specify country + year and it gets dispatched as a task could work? brainstorm with me