Proxyrack - June 24, 2026

How to Scrape Amazon Product Pages and Reviews

Data ScrapingTutorials

Parse Amazon Product Pages

A step-by-step guide to building a local-first HTML parser for Amazon product pages and embedded reviews.


What We're Building

We'll create a small program that:

  1. Fetches an Amazon product page once and saves it as a local HTML fixture

  2. Parses product details (title, brand, rating, price, categories, histogram)

  3. Extracts embedded customer reviews from the same page

  4. Outputs structured AmazonProductData — no browser required

Sample product: Nike Hyperwarm Balaclava  ·  Sample fixture: tests/fixtures/b0959jt4pv_product.html

Local-first rule: HTML is fetched once and saved to disk. All parser development runs against that saved file — never re-fetch during iteration.


Architecture Overview

api.py + httpxfetch once→tests/fixtures/asin_product.html

↓amazon_parse.py↓

product_parser.py↓models/product.py

review_parser.py↓models/review.py

↓AmazonProductData

Layer Role

api.py → Live HTML fetch via httpx (no browser unless Amazon blocks requests)

scripts/fetch_fixture.py → CLI to download and save HTML fixture

amazon_parse.py → Thin parse entrypoint

parsers/ → BeautifulSoup parsers + shared selectors and text helpers

models/ → Strict Pydantic schemas

main.py → Demo CLI — accepts a product URL, parses fixture (or live) and prints summary

tests/ → Offline pytest suite against saved HTML


Step 1: Install Python 3.12

Python Installation

Windows:

  1. Go to python.org/downloads

  2. Click the yellow button that says "Download Python 3.12.x"

  3. Run the downloaded .exe file

  4. IMPORTANT: Check the box that says "Add Python to PATH" at the bottom

  5. Click "Install Now" and wait for it to finish

macOS:

  1. Go to python.org/downloads

  2. Click the yellow button for "Download Python 3.12.x"

  3. Open the downloaded .pkg file and follow the installer

Ubuntu / Debian Linux:

sudo apt update
sudo apt install python3.12 python3.12-venv

Fedora Linux:

sudo dnf install python3.12

Verify it worked: Open a terminal and type:

python3.12 --versionYou should see: Python 3.12.x


Step 2: Install uv (Package Manager)

Install uv

uv is a fast tool that installs Python libraries. We use it instead of pip.

Windows (PowerShell):

powershell -c "irm <https://astral.sh/uv/install.ps1> | iex"

macOS / Linux:

curl -LsSf <https://astral.sh/uv/install.sh> | sh
source $HOME/.local/bin/env

After installing, close and reopen your terminal (or run the source line above to activate immediately).

Verify:

uv --versionShould show something like uv 0.x.x

Linux only — pyright dependency: pyright ships with a bundled Node binary that requires libatomic, which is absent on minimal Ubuntu/Debian installs. If uv run pyright fails with libatomic.so.1: cannot open shared object file, install it:

sudo apt install libatomic1


Step 3: Install Dependencies

Let uv install everything

Open your terminal, navigate to the project folder, then run:

cd amazon.com
uv sync

This reads pyproject.toml and installs:

  • httpx — makes web requests

  • beautifulsoup4 + lxml — parses HTML

  • pydantic — handles data cleanly

Dev tools (installed automatically): pytestruffpyright (strict mode).

Quality checks:

uv run pytest # all tests must pass uv run ruff check . # lint uv run pyright # strict type check


Step 4: Fetch HTML Once

Download and save a product page

Rule: Only fetch when the fixture is missing or Amazon markup has changed. Never re-fetch during parser iteration.

# Default sample product (Nike Balaclava)
 uv run python -m scripts.fetch_fixture

# Any product or product-reviews URL (ASIN is extracted automatically)
 uv run python -m scripts.fetch_fixture "<https://www.amazon.com/dp/B001GAOTSW>"

# Custom output path
 uv run python -m scripts.fetch_fixture "<https://www.amazon.com/dp/B0959JT4PV>" \
  --output tests/fixtures/b0959jt4pv_product.html

What happens:

  1. extract_asin_from_url() pulls the 10-char ASIN from /dp/ASIN or /product-reviews/ASIN

  2. AmazonClient.fetch_product_page() GETs the normalized /dp/{ASIN} URL with browser-like headers

  3. Raw HTML is written to tests/fixtures/{asin_lower}_product.html

Important: The dedicated /product-reviews/{ASIN} page often redirects to Amazon Sign-In over plain HTTP. Reviews are parsed from the embedded review block on the product page, not a separate reviews URL.


Step 5: Inspect the Fixture

Probe the saved HTML before writing selectors

Amazon product pages are large (~2 MB). Inspect the saved HTML locally:

uv run python -c "
from bs4 import BeautifulSoup
html = open('tests/fixtures/b0959jt4pv_product.html').read()
soup = BeautifulSoup(html, 'lxml')
print('title:', soup.select_one('#productTitle').get_text(strip=True)[:60])
print('reviews:', len(soup.select('[data-hook=\"review\"]')))
"

Key DOM areas in the fixture

Area Selector / attribute Parsed by

Title #productTitle product_parser.py

ASIN input#ASIN product_parser.py

Brand #bylineInfo product_parser.py

Rating [data-hook='average-star-rating'] product_parser.py

Review count #acrCustomerReviewText product_parser.py

Rating histogram a[aria-label] inside histogram widget product_parser.py

Detail bullets #detailBullets_feature_div li product_parser.py

Review blocks [data-hook='review'] review_parser.py

Review title [data-hook='reviewTitle'] review_parser.py

Review body [data-hook='reviewText'] review_parser.py

Workflow: Open the fixture, search for data-hook attributes (stable across redesigns), add selectors to parsers/selectors.py, implement parsing logic, run tests.


Step 6: Run the Parser

Parse from local fixture (preferred)

uv run python main.py "<https://www.amazon.com/dp/B0959JT4PV>"

What happens:

  1. Extract ASIN from the product URL

  2. If tests/fixtures/{asin}_product.html exists → parse locally

  3. Else → live fetch via AmazonClient.scrape_product()

  4. Print product summary + first 3 reviews

Expected Output

Parsed from local fixture: b0959jt4pv_product.html
============================================================
Title: Nike unisex-adult mens Balaclava
ASIN: B0959JT4PV
Brand: Nike
Rating: 4.7 (8302 reviews)
...
Reviews parsed from page: 13
- 5.0★ | Great Fit and Quality – Highly Recommend
  I recently bought this Nike ski mask...

Live scrape (integration)

uv run python -c "
from api import AmazonClient
url = '<https://www.amazon.com/Nike-Hyperwarm-Hydropull-Hood-Balaclava/dp/B0959JT4PV>'
with AmazonClient() as client:
    data = client.scrape_product(url)
print(data.product.title, '|', len(data.reviews), 'reviews')
"

product URL → normalize_product_url() → httpx GET → parse_amazon_page() → AmazonProductData


Step 7: Verify Output

Run tests (offline, no network)

uv run pytest -v

Test file What it checks

test_text_utils.py Rating, date, ASIN, histogram parsing

test_product_parser.py ASIN, title, brand, rating, histogram from fixture

test_review_parser.py ≥10 reviews, fields populated

test_amazon_parse.py End-to-end parse_amazon_page()

test_url_utils.py Review URL → product URL normalization

Quick sanity check

uv run python -c "
from amazon_parse import parse_amazon_page
html = open('tests/fixtures/b0959jt4pv_product.html').read()
data = parse_amazon_page(html)
p, rs = data.product, data.reviews
assert p.asin == 'B0959JT4PV'
assert p.average_rating == 4.7
assert len(rs) >= 10
assert rs[0].body
print('OK:', p.title, '|', len(rs), 'reviews')
"

Key Project Files

pyproject.toml

[project]
name = "amazon-com"
version = "0.1.0"
description = "Amazon product and review scraper"
requires-python = ">=3.12"
dependencies = [
    "beautifulsoup4>=4.15.0",
    "httpx>=0.28.1",
    "lxml>=6.1.1",
    "pydantic>=2.13.4",
]

[dependency-groups]
dev = ["pyright>=1.1.410", "pytest>=9.0.3", "ruff>=0.15.16"]

constants.py

from pathlib import Path

BASE_DIR = Path(__file__).parent
FIXTURES_DIR = BASE_DIR / "tests" / "fixtures"

AMAZON_BASE_URL = "<https://www.amazon.com>"

DEFAULT_HEADERS: dict[str, str] = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "User-Agent": (
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    ),
}

amazon_parse.py

from models.response import AmazonProductData
from parsers.amazon_parser import parse_amazon_product_page

def parse_amazon_page(
    html: str,
    *,
    source_url: str | None = None,
) -> AmazonProductData:
    return parse_amazon_product_page(html, source_url=source_url)

api.py

import httpx

from amazon_parse import parse_amazon_page
from constants import DEFAULT_HEADERS, DEFAULT_REQUEST_TIMEOUT_SECONDS
from models.response import AmazonProductData
from parsers.url_utils import normalize_product_url

class AmazonClient:
    def __init__(self) -> None:
        self.client: httpx.Client = httpx.Client(
            headers=DEFAULT_HEADERS,
            timeout=DEFAULT_REQUEST_TIMEOUT_SECONDS,
            follow_redirects=True,
        )

    def fetch_product_page(self, product_url: str) -> str:
        normalized_url = normalize_product_url(product_url)
        response = self.client.get(normalized_url)
        response.raise_for_status()
        return response.text

    def scrape_product(self, product_url: str) -> AmazonProductData:
        normalized_url = normalize_product_url(product_url)
        html = self.fetch_product_page(normalized_url)
        return parse_amazon_page(html, source_url=normalized_url)

parsers/selectors.py (excerpt)

PRODUCT_TITLE = "#productTitle"
ASIN_INPUT = "input#ASIN"
BRAND = "#bylineInfo"
REVIEW_COUNT = "#acrCustomerReviewText, [data-hook='total-review-count']"
DETAIL_BULLETS = "#detailBullets_feature_div li"
HISTOGRAM_LINKS = '[data-csa-c-content-id="customerReviews-histogram"] a[aria-label]'

REVIEW_BLOCKS = '[data-hook="review"]'
REVIEW_TITLE = '[data-hook="reviewTitle"]'
REVIEW_BODY = '[data-hook="reviewText"]'
VERIFIED_BADGE = '[data-hook="avp-badge"]'

All selectors live in one file. When Amazon changes markup, update parsers/selectors.py first — not scattered across parser modules.


Output Data Shape

All structured output lives in Pydantic models under models/:

File Models

models/product.py ProductDetails, ProductImage, RatingHistogram

models/review.py Review, ReviewAuthor

models/response.py AmazonProductData (root: product + reviews)

AmazonProductData(
    product=ProductDetails(
        asin="B0959JT4PV",
        title="Nike unisex-adult mens Balaclava",
        brand="Nike",
        average_rating=4.7,
        total_review_count=8302,
        rating_histogram=RatingHistogram(
            five_star_percent=86,
            four_star_percent=7,
            three_star_percent=4,
            two_star_percent=0,
            one_star_percent=3,
        ),
    ),
    reviews=[
        Review(
            review_id="R2EX7RQABO7WG5",
            title="Great Fit and Quality – Highly Recommend",
            rating=5.0,
            body="I recently bought this Nike ski mask...",
            verified_purchase=True,
            author=ReviewAuthor(name="ryaine rose"),
        ),
        # ~13 reviews embedded on product page
    ],
)

Export to JSON: data.model_dump(mode="json")


Parser Iteration Loop

Improve parsers without fetching again

  1. Read tests/fixtures/{asin}_product.html — search for the field that failed

  2. Add or update the selector in parsers/selectors.py

  3. Edit product_parser.py or review_parser.py (or text_utils.py for text cleanup)

  4. Run uv run pytest tests/test_product_parser.py -v (or the relevant test file)

  5. Run uv run pyright and uv run ruff check . — strict typing is required

  6. Repeat until tests pass and field values look correct

Do not edit the fixture during parser iteration unless Amazon markup genuinely changed and you need a fresh snapshot.


Adding a New Product Fixture

Three-step process

  1. Fetch
    uv run python -m scripts.fetch_fixture "https://www.amazon.com/dp/NEWASIN1234"

  2. Add test assertions (copy pattern from test_product_parser.py)

  3. Run tests

    uv run pytest -v`


Troubleshooting

Common Problems & Fixes

Issue Fix

ModuleNotFoundError: constants Run via uv run (pytest pythonpath = ["."] is configured)

Fixture not found in main.py Run uv run python -m scripts.fetch_fixture <url> or check ASIN matches filename {asin_lower}_product.html

Empty reviews Search fixture for [data-hook="review"]; update selectors.py

reviewTitle not found Amazon uses camelCase data-hook="reviewTitle", not review-title

Review body has "Brief content visible..." Use clean_review_body() in text_utils.py

price is None Price selectors vary by listing type; add selector in selectors.py

Live fetch returns sign-in page Amazon bot detection; use saved fixture or add session cookies/headers

/product-reviews/ URL fails Expected — parse reviews from product page /dp/{ASIN} instead

libatomic.so.1: cannot open shared object file Pyright's bundled Node needs libatomic1 on minimal Ubuntu/Debian: sudo apt install libatomic1

Pyright errors on HttpUrl Use TypeAdapter(HttpUrl).validate_python(url_str)

Tests fail after selector change Update test assertions or re-fetch if markup changed


Design Rules

  1. Never fetch during parser iteration — only read tests/fixtures/*.html

  2. Use api.py for all live fetches — single place for headers and URL normalization

  3. Keep models in models/ — no inline dicts in parser code

  4. Keep selectors in parsers/selectors.py — one file to update when Amazon changes markup

  5. Prefer data-hook and aria-label over hashed CSS classes

  6. Strict typing required — pyright mode is strict

  7. No browser unless necessary — httpx works for product pages

  8. One parser per concern — product details and reviews are separate modules


Final File Checklist

Your folder should look like this:

amazon.com/
├── index.html # This guide
├── TUTORIAL.md
├── api.py # AmazonClient — httpx fetch + scrape_product()
├── amazon_parse.py # parse_amazon_page(html) entrypoint
├── constants.py # AMAZON_BASE_URL, DEFAULT_HEADERS, FIXTURES_DIR
├── main.py # Demo CLI
├── pyproject.toml
├── scripts/
│ └── fetch_fixture.py # CLI: download HTML → tests/fixtures/
├── models/
│ ├── product.py # ProductDetails, RatingHistogram
│ ├── review.py # Review, ReviewAuthor
│ └── response.py # AmazonProductData
├── parsers/
│ ├── selectors.py # ★ Update selectors here
│ ├── text_utils.py # Text/date/rating/ASIN helpers
│ ├── url_utils.py # URL normalization
│ ├── product_parser.py # Product DOM → ProductDetails
│ ├── review_parser.py # Review DOM → list[Review]
│ └── amazon_parser.py # Orchestrator
└── tests/
├── conftest.py
├── fixtures/
│ └── b0959jt4pv_product.html
├── test_text_utils.py
├── test_product_parser.py
├── test_review_parser.py
├── test_amazon_parse.py
└── test_url_utils.py

You're done! Parse anytime with:

uv run python main.py "<https://www.amazon.com/dp/B0959JT4PV>"

Verify offline with:

uv run pytest

Get Started by signing up for a Proxy Product