A step-by-step guide to building a local-first HTML parser for Amazon product pages and embedded reviews.
We'll create a small program that:
Fetches an Amazon product page once and saves it as a local HTML fixture
Parses product details (title, brand, rating, price, categories, histogram)
Extracts embedded customer reviews from the same page
Outputs structured AmazonProductData — no browser required
Sample product: Nike Hyperwarm Balaclava · Sample fixture: tests/fixtures/b0959jt4pv_product.html
Local-first rule: HTML is fetched once and saved to disk. All parser development runs against that saved file — never re-fetch during iteration.
api.py + httpxfetch once→tests/fixtures/asin_product.html
↓amazon_parse.py↓
product_parser.py↓models/product.py
review_parser.py↓models/review.py
↓AmazonProductData
Layer → Role
api.py → Live HTML fetch via httpx (no browser unless Amazon blocks requests)
scripts/fetch_fixture.py → CLI to download and save HTML fixture
amazon_parse.py → Thin parse entrypoint
parsers/ → BeautifulSoup parsers + shared selectors and text helpers
models/ → Strict Pydantic schemas
main.py → Demo CLI — accepts a product URL, parses fixture (or live) and prints summary
tests/ → Offline pytest suite against saved HTML
Windows:
Go to python.org/downloads
Click the yellow button that says "Download Python 3.12.x"
Run the downloaded .exe file
IMPORTANT: Check the box that says "Add Python to PATH" at the bottom
Click "Install Now" and wait for it to finish
macOS:
Go to python.org/downloads
Click the yellow button for "Download Python 3.12.x"
Open the downloaded .pkg file and follow the installer
Ubuntu / Debian Linux:
sudo apt update
sudo apt install python3.12 python3.12-venv
Fedora Linux:
sudo dnf install python3.12
Verify it worked: Open a terminal and type:
python3.12 --versionYou should see: Python 3.12.x
uv is a fast tool that installs Python libraries. We use it instead of pip.
Windows (PowerShell):
powershell -c "irm <https://astral.sh/uv/install.ps1> | iex"
macOS / Linux:
curl -LsSf <https://astral.sh/uv/install.sh> | sh
source $HOME/.local/bin/env
After installing, close and reopen your terminal (or run the source line above to activate immediately).
Verify:
uv --versionShould show something like uv 0.x.x
Linux only — pyright dependency: pyright ships with a bundled Node binary that requires libatomic, which is absent on minimal Ubuntu/Debian installs. If uv run pyright fails with libatomic.so.1: cannot open shared object file, install it:
sudo apt install libatomic1
Open your terminal, navigate to the project folder, then run:
cd amazon.com
uv sync
This reads pyproject.toml and installs:
httpx — makes web requests
beautifulsoup4 + lxml — parses HTML
pydantic — handles data cleanly
Dev tools (installed automatically): pytest, ruff, pyright (strict mode).
Quality checks:
uv run pytest # all tests must pass uv run ruff check . # lint uv run pyright # strict type check
Rule: Only fetch when the fixture is missing or Amazon markup has changed. Never re-fetch during parser iteration.
# Default sample product (Nike Balaclava)
uv run python -m scripts.fetch_fixture
# Any product or product-reviews URL (ASIN is extracted automatically)
uv run python -m scripts.fetch_fixture "<https://www.amazon.com/dp/B001GAOTSW>"
# Custom output path
uv run python -m scripts.fetch_fixture "<https://www.amazon.com/dp/B0959JT4PV>" \
--output tests/fixtures/b0959jt4pv_product.html
What happens:
extract_asin_from_url() pulls the 10-char ASIN from /dp/ASIN or /product-reviews/ASIN
AmazonClient.fetch_product_page() GETs the normalized /dp/{ASIN} URL with browser-like headers
Raw HTML is written to tests/fixtures/{asin_lower}_product.html
Important: The dedicated /product-reviews/{ASIN} page often redirects to Amazon Sign-In over plain HTTP. Reviews are parsed from the embedded review block on the product page, not a separate reviews URL.
Amazon product pages are large (~2 MB). Inspect the saved HTML locally:
uv run python -c "
from bs4 import BeautifulSoup
html = open('tests/fixtures/b0959jt4pv_product.html').read()
soup = BeautifulSoup(html, 'lxml')
print('title:', soup.select_one('#productTitle').get_text(strip=True)[:60])
print('reviews:', len(soup.select('[data-hook=\"review\"]')))
"
Area Selector / attribute Parsed by
Title #productTitle product_parser.py
ASIN input#ASIN product_parser.py
Brand #bylineInfo product_parser.py
Rating [data-hook='average-star-rating'] product_parser.py
Review count #acrCustomerReviewText product_parser.py
Rating histogram a[aria-label] inside histogram widget product_parser.py
Detail bullets #detailBullets_feature_div li product_parser.py
Review blocks [data-hook='review'] review_parser.py
Review title [data-hook='reviewTitle'] review_parser.py
Review body [data-hook='reviewText'] review_parser.py
Workflow: Open the fixture, search for data-hook attributes (stable across redesigns), add selectors to parsers/selectors.py, implement parsing logic, run tests.
uv run python main.py "<https://www.amazon.com/dp/B0959JT4PV>"
What happens:
Extract ASIN from the product URL
If tests/fixtures/{asin}_product.html exists → parse locally
Else → live fetch via AmazonClient.scrape_product()
Print product summary + first 3 reviews
Parsed from local fixture: b0959jt4pv_product.html
============================================================
Title: Nike unisex-adult mens Balaclava
ASIN: B0959JT4PV
Brand: Nike
Rating: 4.7 (8302 reviews)
...
Reviews parsed from page: 13
- 5.0★ | Great Fit and Quality – Highly Recommend
I recently bought this Nike ski mask...
uv run python -c "
from api import AmazonClient
url = '<https://www.amazon.com/Nike-Hyperwarm-Hydropull-Hood-Balaclava/dp/B0959JT4PV>'
with AmazonClient() as client:
data = client.scrape_product(url)
print(data.product.title, '|', len(data.reviews), 'reviews')
"
product URL → normalize_product_url() → httpx GET → parse_amazon_page() → AmazonProductData
uv run pytest -v
Test file What it checks
test_text_utils.py Rating, date, ASIN, histogram parsing
test_product_parser.py ASIN, title, brand, rating, histogram from fixture
test_review_parser.py ≥10 reviews, fields populated
test_amazon_parse.py End-to-end parse_amazon_page()
test_url_utils.py Review URL → product URL normalization
uv run python -c "
from amazon_parse import parse_amazon_page
html = open('tests/fixtures/b0959jt4pv_product.html').read()
data = parse_amazon_page(html)
p, rs = data.product, data.reviews
assert p.asin == 'B0959JT4PV'
assert p.average_rating == 4.7
assert len(rs) >= 10
assert rs[0].body
print('OK:', p.title, '|', len(rs), 'reviews')
"
pyproject.toml[project]
name = "amazon-com"
version = "0.1.0"
description = "Amazon product and review scraper"
requires-python = ">=3.12"
dependencies = [
"beautifulsoup4>=4.15.0",
"httpx>=0.28.1",
"lxml>=6.1.1",
"pydantic>=2.13.4",
]
[dependency-groups]
dev = ["pyright>=1.1.410", "pytest>=9.0.3", "ruff>=0.15.16"]
constants.pyfrom pathlib import Path
BASE_DIR = Path(__file__).parent
FIXTURES_DIR = BASE_DIR / "tests" / "fixtures"
AMAZON_BASE_URL = "<https://www.amazon.com>"
DEFAULT_HEADERS: dict[str, str] = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
),
}
amazon_parse.pyfrom models.response import AmazonProductData
from parsers.amazon_parser import parse_amazon_product_page
def parse_amazon_page(
html: str,
*,
source_url: str | None = None,
) -> AmazonProductData:
return parse_amazon_product_page(html, source_url=source_url)
api.pyimport httpx
from amazon_parse import parse_amazon_page
from constants import DEFAULT_HEADERS, DEFAULT_REQUEST_TIMEOUT_SECONDS
from models.response import AmazonProductData
from parsers.url_utils import normalize_product_url
class AmazonClient:
def __init__(self) -> None:
self.client: httpx.Client = httpx.Client(
headers=DEFAULT_HEADERS,
timeout=DEFAULT_REQUEST_TIMEOUT_SECONDS,
follow_redirects=True,
)
def fetch_product_page(self, product_url: str) -> str:
normalized_url = normalize_product_url(product_url)
response = self.client.get(normalized_url)
response.raise_for_status()
return response.text
def scrape_product(self, product_url: str) -> AmazonProductData:
normalized_url = normalize_product_url(product_url)
html = self.fetch_product_page(normalized_url)
return parse_amazon_page(html, source_url=normalized_url)
parsers/selectors.py (excerpt)PRODUCT_TITLE = "#productTitle"
ASIN_INPUT = "input#ASIN"
BRAND = "#bylineInfo"
REVIEW_COUNT = "#acrCustomerReviewText, [data-hook='total-review-count']"
DETAIL_BULLETS = "#detailBullets_feature_div li"
HISTOGRAM_LINKS = '[data-csa-c-content-id="customerReviews-histogram"] a[aria-label]'
REVIEW_BLOCKS = '[data-hook="review"]'
REVIEW_TITLE = '[data-hook="reviewTitle"]'
REVIEW_BODY = '[data-hook="reviewText"]'
VERIFIED_BADGE = '[data-hook="avp-badge"]'
All selectors live in one file. When Amazon changes markup, update parsers/selectors.py first — not scattered across parser modules.
All structured output lives in Pydantic models under models/:
File Models
models/product.py ProductDetails, ProductImage, RatingHistogram
models/review.py Review, ReviewAuthor
models/response.py AmazonProductData (root: product + reviews)
AmazonProductData(
product=ProductDetails(
asin="B0959JT4PV",
title="Nike unisex-adult mens Balaclava",
brand="Nike",
average_rating=4.7,
total_review_count=8302,
rating_histogram=RatingHistogram(
five_star_percent=86,
four_star_percent=7,
three_star_percent=4,
two_star_percent=0,
one_star_percent=3,
),
),
reviews=[
Review(
review_id="R2EX7RQABO7WG5",
title="Great Fit and Quality – Highly Recommend",
rating=5.0,
body="I recently bought this Nike ski mask...",
verified_purchase=True,
author=ReviewAuthor(name="ryaine rose"),
),
# ~13 reviews embedded on product page
],
)
Export to JSON: data.model_dump(mode="json")
Read tests/fixtures/{asin}_product.html — search for the field that failed
Add or update the selector in parsers/selectors.py
Edit product_parser.py or review_parser.py (or text_utils.py for text cleanup)
Run uv run pytest tests/test_product_parser.py -v (or the relevant test file)
Run uv run pyright and uv run ruff check . — strict typing is required
Repeat until tests pass and field values look correct
Do not edit the fixture during parser iteration unless Amazon markup genuinely changed and you need a fresh snapshot.
Three-step process
Fetch
uv run python -m scripts.fetch_fixture "https://www.amazon.com/dp/NEWASIN1234"
Add test assertions (copy pattern from test_product_parser.py)
Run tests
uv run pytest -v`
Issue Fix
ModuleNotFoundError: constants Run via uv run (pytest pythonpath = ["."] is configured)
Fixture not found in main.py Run uv run python -m scripts.fetch_fixture <url> or check ASIN matches filename {asin_lower}_product.html
Empty reviews Search fixture for [data-hook="review"]; update selectors.py
reviewTitle not found Amazon uses camelCase data-hook="reviewTitle", not review-title
Review body has "Brief content visible..." Use clean_review_body() in text_utils.py
price is None Price selectors vary by listing type; add selector in selectors.py
Live fetch returns sign-in page Amazon bot detection; use saved fixture or add session cookies/headers
/product-reviews/ URL fails Expected — parse reviews from product page /dp/{ASIN} instead
libatomic.so.1: cannot open shared object file Pyright's bundled Node needs libatomic1 on minimal Ubuntu/Debian: sudo apt install libatomic1
Pyright errors on HttpUrl Use TypeAdapter(HttpUrl).validate_python(url_str)
Tests fail after selector change Update test assertions or re-fetch if markup changed
Never fetch during parser iteration — only read tests/fixtures/*.html
Use api.py for all live fetches — single place for headers and URL normalization
Keep models in models/ — no inline dicts in parser code
Keep selectors in parsers/selectors.py — one file to update when Amazon changes markup
Prefer data-hook and aria-label over hashed CSS classes
Strict typing required — pyright mode is strict
No browser unless necessary — httpx works for product pages
One parser per concern — product details and reviews are separate modules
Your folder should look like this:
amazon.com/
├── index.html # This guide
├── TUTORIAL.md
├── api.py # AmazonClient — httpx fetch + scrape_product()
├── amazon_parse.py # parse_amazon_page(html) entrypoint
├── constants.py # AMAZON_BASE_URL, DEFAULT_HEADERS, FIXTURES_DIR
├── main.py # Demo CLI
├── pyproject.toml
├── scripts/
│ └── fetch_fixture.py # CLI: download HTML → tests/fixtures/
├── models/
│ ├── product.py # ProductDetails, RatingHistogram
│ ├── review.py # Review, ReviewAuthor
│ └── response.py # AmazonProductData
├── parsers/
│ ├── selectors.py # ★ Update selectors here
│ ├── text_utils.py # Text/date/rating/ASIN helpers
│ ├── url_utils.py # URL normalization
│ ├── product_parser.py # Product DOM → ProductDetails
│ ├── review_parser.py # Review DOM → list[Review]
│ └── amazon_parser.py # Orchestrator
└── tests/
├── conftest.py
├── fixtures/
│ └── b0959jt4pv_product.html
├── test_text_utils.py
├── test_product_parser.py
├── test_review_parser.py
├── test_amazon_parse.py
└── test_url_utils.py
You're done! Parse anytime with:
uv run python main.py "<https://www.amazon.com/dp/B0959JT4PV>"Verify offline with:
uv run pytestKaty Salgado - October 30, 2025
Why Residential IP Intelligence Services Are Highly Inaccurate?
Katy Salgado - November 13, 2025
Why Unmetered Proxies Are Cheaper (Even With a Lower Success Rate)
Katy Salgado - November 27, 2025
TCP OS Fingerprinting: How Websites Detect Automated Requests (and How Proxies Help)
Katy Salgado - December 15, 2025
Analyzing Competitor TCP Fingerprints: Do Their Opt-In Networks Really Match Their Public Claims?