Proxyrack - May 26, 2026

Building a Naver Search HTML Parser (Agent Tutorial)

Data ScrapingTutorials

This tutorial walks an AI coding agent through building a local-first HTML parser for Naver search results. The browser is used once to fetch a snapshot; all parser development runs against saved HTML on disk.

Target URL: https://www.naver.com/ (entry point)

Sample parsed page: Naver unified search for shoes

Output: Structured JSON (products, blogs, videos, web, images, dictionary, related keywords)

Architecture overview

flowchart LR
A[api.py + browser] -->|fetch once| B[index.html]

B --> C[split_html.py]

C --> D[html_pieces/]

D --> E[parser/ modules]

B --> E

E --> F[output.json]

Layer Role

api.py Safe rendered HTML fetch via headless Brave (CDP)

models/ Pydantic schemas (Product, BlogPost, SearchResult, …)

split_html.py Splits DOM sections for incremental parser work

parser/ Section-specific BeautifulSoup parsers

main.py CLI orchestrator — parse local HTML → JSON

Prerequisites

cd /path/to/naver
uv sync          # or: pip install -e .

Dependencies: beautifulsoup4, pydantic, browser stack (httpx, websocket-client, psutil).

Brave must be available at the path in constants.py (DEBUG_BROWSER_PATH).

Step 1 — Fetch HTML once (browser via api.py)

Rule: Only fetch when index.html is missing. Never re-fetch during parser iteration.

# First time: fetch search results (recommended — rich parseable data)
python fetch_html.py --query shoes

# Or fetch homepage only
python fetch_html.py --home

# Force re-fetch (rare)
python fetch_html.py --query shoes --force

What happens:

  1. get_browser_instance() starts Brave with remote debugging.

  2. GetNaverSearchHtml() or GetNaverHomeHtml() in api.py calls render_html().

  3. Full <body> outer HTML is written to index.html.

Agent note: After this step, close the browser loop. All further work uses index.html.

Step 2 — Split HTML into DOM pieces

Naver search pages are large (~13k+ lines). Split by major sections before writing selectors.

python split_html.py

This creates html_pieces/ with files like:

File pattern Section

*_powerlink.html 파워링크 ads (#power_link_body)

*_shopping_price.html 네이버 가격비교 (#shp_tre_root)

*_plus_store.html 네이버플러스 스토어 (#shs_lis_root)

*_ugc_reviews.html Cafe/blog UGC blocks

*_web_results.html Organic web results

*_images.html Image grid

*_dictionary.html 영어사전 (ldc_btm)

*_related_keywords.html 연관 검색어 sidebar

html_pieces/manifest.json maps section names → filenames.

Agent workflow: Open one piece at a time in the editor, identify stable selectors, implement the matching parser module, run main.py, inspect counts.

Step 3 — Data models (models/)

All structured output lives in Pydantic models:

  • models/listing.py — Product, BlogPost, Video, WebResult, ImageResult, DictionaryEntry, RelatedKeyword, SearchResult

  • models/search_meta.py — SearchMeta, NaverParseResult

Example product fields:

Product(
    title="발편한 가벼운 신발 포레스트 메리노울 운동화",
    price=116900,
    discount_rate=20,
    store_name="르무통",
    source_section="plus_store",  # or powerlink | price_comparison
    badges=["플러스세일", "공식"],
)

Step 4 — Parser modules (parser/)

Each file handles one Naver section:

Module Parses

meta.py Search query, tab, URL from #nx_query

powerlink.py #power_link_body ul.lst_type > li

shopping.py #shp_tre_root, #shs_lis_root product cards

ugc.py [data-meta-ssuid="review"] cafe/blog posts

videos.py Inline video + 네이버 클립 carousel

web.py [data-meta-ssuid="web"] organic links

images.py [data-meta-area="urB_imM"] image grid

dictionary.py [data-meta-area="ldc_btm"] + Merriam-Webster web block

related.py .related_srch .lst_related_srch

utils.py Price parsing (117,900원, 2.8만), URL helpers

naver_parser.py Orchestrator + JSON export

Shared helpers in parser/utils.py:

  • parse_korean_int("2.8만") → 28000

  • parse_price("117,900") → 117900

  • strip_mark_tags() — removes <mark> highlight wrappers

Step 5 — Run the scraper (local HTML only)

# Default: parse full index.html, auto-split if needed
python main.py

# Parse from pre-split pieces (faster iteration)
python main.py --pieces

# Re-split then parse
python main.py --split

# Custom output path
python main.py --output results/shoes.json

Expected output (sample shoes snapshot):

Query: 'shoes'
Products: 22
Blog posts: 4
Videos: 22
Images: 7
Web results: 9
Dictionary: 2
Related keywords: 6
Saved JSON -> output.json

Step 6 — Verify JSON output

output.json structure:

{
  "meta": {
    "query": "shoes",
    "search_url": "https://search.naver.com/search.naver?query=shoes",
    "tab": "전체"
  },
  "products": [ ... ],
  "blog_posts": [ ... ],
  "videos": [ ... ],
  "images": [ ... ],
  "web_results": [ ... ],
  "dictionary": [ ... ],
  "related_keywords": [ ... ]
}

Quick sanity check:

python -c "
import json
d = json.load(open('output.json'))
assert d['meta']['query']
assert len(d['products']) > 0
print('OK:', {k: len(v) if isinstance(v, list) else v for k, v in d.items()})
"

Agent iteration loop (recommended)

When improving parsers without fetching again:

  1. Read the relevant file in html_pieces/ (use manifest.json).

  2. Edit the matching module under parser/.

  3. Run python main.py --pieces.

  4. Compare counts and spot-check output.json.

  5. Repeat until all sections extract real values.

Key DOM selectors reference

PowerLink ads

#power_link_body ul.lst_type > li.lst
.lnk_tit          /* title */
.link_desc        /* description */
a.site            /* store name */

Shopping cards (price comparison & plus store)

#shp_tre_root li.q86P3e7M
#shs_lis_root li.K70iQ12A
span.Q9_4wzl0      /* price */
span.rjwcz7fY      /* discount % */
span.mlLzqQ3t      /* review score */
a.iMhVFYLc         /* store name */

Web results (fender renderer)

[data-meta-ssuid="web"] .fds-web-doc-root
a[data-heatmap-target=".link"] span.sds-comps-text-type-headline1

Dictionary

[data-meta-area="ldc_btm"] mark
[data-meta-area="ldc_btm"] [data-audioid]
span.kTlmTmRATGOguYiV24_u   /* Korean meaning */

Related keywords

.related_srch .lst_related_srch li.item a.keyword

File map

naver/
├── api.py                 # Browser HTML fetch (use once)
├── fetch_html.py          # CLI: save index.html
├── split_html.py          # CLI: index.html → html_pieces/
├── main.py                # CLI: parse → output.json
├── constants.py           # Paths, URLs, browser config
├── index.html             # Saved snapshot (do not re-fetch)
├── html_pieces/           # Split DOM chunks + manifest.json
├── output.json            # Parsed result
├── models/
│   ├── listing.py
│   ├── search_meta.py
│   └── browser.py
└── parser/
    ├── naver_parser.py
    ├── meta.py
    ├── powerlink.py
    ├── shopping.py
    ├── ugc.py
    ├── videos.py
    ├── web.py
    ├── images.py
    ├── dictionary.py
    ├── related.py
    └── utils.py

Troubleshooting

Issue Fix

Missing index.html Run python fetch_html.py --query shoes once

Empty products Check html_pieces/_shopping.html — Naver obfuscates class names periodically; update selectors

Browser fails to start Verify Brave path in constants.py; port 9999 must be free

Dictionary empty Ensure ldc_btm piece exists; word is in mark inside title span

Duplicate videos Dedup runs on video_url; clip + inline may share URLs

Design rules (for agents)

  1. Never fetch in the parser — only read index.html or html_pieces/.

  2. Use api.py for all live fetches — avoids bot blocks vs raw httpx.

  3. Keep models in models/ — no inline dicts in parser code.

  4. One module per DOM section — keeps diffs small and testable.

  5. Prefer data-meta- attributes* over hashed CSS classes where available — they survive redesigns better.

Next steps (optional extensions)

  • Add CLI --query to re-fetch + parse in one command (still one fetch per invocation).

  • Parse pagination / main2 sidebar sections (rsk_top, stX_cpT).

  • Export to CSV per section.

  • Unit tests using html_pieces/ fixtures (no browser required).

Tutorial complete. Run python main.py to produce output.json from the local Naver HTML snapshot.

Get Started by signing up for a Proxy Product