This tutorial walks an AI coding agent through building a local-first HTML parser for Naver search results. The browser is used once to fetch a snapshot; all parser development runs against saved HTML on disk.
Target URL: https://www.naver.com/ (entry point)
Sample parsed page: Naver unified search for shoes
Output: Structured JSON (products, blogs, videos, web, images, dictionary, related keywords)
flowchart LR
A[api.py + browser] -->|fetch once| B[index.html]
B --> C[split_html.py]
C --> D[html_pieces/]
D --> E[parser/ modules]
B --> E
E --> F[output.json]Layer Role
api.py Safe rendered HTML fetch via headless Brave (CDP)
models/ Pydantic schemas (Product, BlogPost, SearchResult, …)
split_html.py Splits DOM sections for incremental parser work
parser/ Section-specific BeautifulSoup parsers
main.py CLI orchestrator — parse local HTML → JSON
cd /path/to/naver
uv sync # or: pip install -e .
Dependencies: beautifulsoup4, pydantic, browser stack (httpx, websocket-client, psutil).
Brave must be available at the path in constants.py (DEBUG_BROWSER_PATH).
Rule: Only fetch when index.html is missing. Never re-fetch during parser iteration.
# First time: fetch search results (recommended — rich parseable data)
python fetch_html.py --query shoes
# Or fetch homepage only
python fetch_html.py --home
# Force re-fetch (rare)
python fetch_html.py --query shoes --forceWhat happens:
get_browser_instance() starts Brave with remote debugging.
GetNaverSearchHtml() or GetNaverHomeHtml() in api.py calls render_html().
Full <body> outer HTML is written to index.html.
Agent note: After this step, close the browser loop. All further work uses index.html.
Naver search pages are large (~13k+ lines). Split by major sections before writing selectors.
python split_html.py
This creates html_pieces/ with files like:
File pattern Section
*_powerlink.html 파워링크 ads (#power_link_body)
*_shopping_price.html 네이버 가격비교 (#shp_tre_root)
*_plus_store.html 네이버플러스 스토어 (#shs_lis_root)
*_ugc_reviews.html Cafe/blog UGC blocks
*_web_results.html Organic web results
*_images.html Image grid
*_dictionary.html 영어사전 (ldc_btm)
*_related_keywords.html 연관 검색어 sidebar
html_pieces/manifest.json maps section names → filenames.
Agent workflow: Open one piece at a time in the editor, identify stable selectors, implement the matching parser module, run main.py, inspect counts.
All structured output lives in Pydantic models:
models/listing.py — Product, BlogPost, Video, WebResult, ImageResult, DictionaryEntry, RelatedKeyword, SearchResult
models/search_meta.py — SearchMeta, NaverParseResult
Example product fields:
Product(
title="발편한 가벼운 신발 포레스트 메리노울 운동화",
price=116900,
discount_rate=20,
store_name="르무통",
source_section="plus_store", # or powerlink | price_comparison
badges=["플러스세일", "공식"],
)
Each file handles one Naver section:
Module Parses
meta.py Search query, tab, URL from #nx_query
powerlink.py #power_link_body ul.lst_type > li
shopping.py #shp_tre_root, #shs_lis_root product cards
ugc.py [data-meta-ssuid="review"] cafe/blog posts
videos.py Inline video + 네이버 클립 carousel
web.py [data-meta-ssuid="web"] organic links
images.py [data-meta-area="urB_imM"] image grid
dictionary.py [data-meta-area="ldc_btm"] + Merriam-Webster web block
related.py .related_srch .lst_related_srch
utils.py Price parsing (117,900원, 2.8만), URL helpers
naver_parser.py Orchestrator + JSON export
Shared helpers in parser/utils.py:
parse_korean_int("2.8만") → 28000
parse_price("117,900") → 117900
strip_mark_tags() — removes <mark> highlight wrappers
# Default: parse full index.html, auto-split if needed
python main.py
# Parse from pre-split pieces (faster iteration)
python main.py --pieces
# Re-split then parse
python main.py --split
# Custom output path
python main.py --output results/shoes.jsonExpected output (sample shoes snapshot):
Query: 'shoes'
Products: 22
Blog posts: 4
Videos: 22
Images: 7
Web results: 9
Dictionary: 2
Related keywords: 6
Saved JSON -> output.jsonoutput.json structure:
{
"meta": {
"query": "shoes",
"search_url": "https://search.naver.com/search.naver?query=shoes",
"tab": "전체"
},
"products": [ ... ],
"blog_posts": [ ... ],
"videos": [ ... ],
"images": [ ... ],
"web_results": [ ... ],
"dictionary": [ ... ],
"related_keywords": [ ... ]
}Quick sanity check:
python -c "
import json
d = json.load(open('output.json'))
assert d['meta']['query']
assert len(d['products']) > 0
print('OK:', {k: len(v) if isinstance(v, list) else v for k, v in d.items()})
"
When improving parsers without fetching again:
Read the relevant file in html_pieces/ (use manifest.json).
Edit the matching module under parser/.
Run python main.py --pieces.
Compare counts and spot-check output.json.
Repeat until all sections extract real values.
#power_link_body ul.lst_type > li.lst
.lnk_tit /* title */
.link_desc /* description */
a.site /* store name */
#shp_tre_root li.q86P3e7M
#shs_lis_root li.K70iQ12A
span.Q9_4wzl0 /* price */
span.rjwcz7fY /* discount % */
span.mlLzqQ3t /* review score */
a.iMhVFYLc /* store name */
[data-meta-ssuid="web"] .fds-web-doc-root
a[data-heatmap-target=".link"] span.sds-comps-text-type-headline1
[data-meta-area="ldc_btm"] mark
[data-meta-area="ldc_btm"] [data-audioid]
span.kTlmTmRATGOguYiV24_u /* Korean meaning */
.related_srch .lst_related_srch li.item a.keyword
naver/
├── api.py # Browser HTML fetch (use once)
├── fetch_html.py # CLI: save index.html
├── split_html.py # CLI: index.html → html_pieces/
├── main.py # CLI: parse → output.json
├── constants.py # Paths, URLs, browser config
├── index.html # Saved snapshot (do not re-fetch)
├── html_pieces/ # Split DOM chunks + manifest.json
├── output.json # Parsed result
├── models/
│ ├── listing.py
│ ├── search_meta.py
│ └── browser.py
└── parser/
├── naver_parser.py
├── meta.py
├── powerlink.py
├── shopping.py
├── ugc.py
├── videos.py
├── web.py
├── images.py
├── dictionary.py
├── related.py
└── utils.pyIssue Fix
Missing index.html Run python fetch_html.py --query shoes once
Empty products Check html_pieces/_shopping.html — Naver obfuscates class names periodically; update selectors
Browser fails to start Verify Brave path in constants.py; port 9999 must be free
Dictionary empty Ensure ldc_btm piece exists; word is in mark inside title span
Duplicate videos Dedup runs on video_url; clip + inline may share URLs
Never fetch in the parser — only read index.html or html_pieces/.
Use api.py for all live fetches — avoids bot blocks vs raw httpx.
Keep models in models/ — no inline dicts in parser code.
One module per DOM section — keeps diffs small and testable.
Prefer data-meta- attributes* over hashed CSS classes where available — they survive redesigns better.
Add CLI --query to re-fetch + parse in one command (still one fetch per invocation).
Parse pagination / main2 sidebar sections (rsk_top, stX_cpT).
Export to CSV per section.
Unit tests using html_pieces/ fixtures (no browser required).
Tutorial complete. Run python main.py to produce output.json from the local Naver HTML snapshot.
Katy Salgado - October 30, 2025
Why Residential IP Intelligence Services Are Highly Inaccurate?
Katy Salgado - November 13, 2025
Why Unmetered Proxies Are Cheaper (Even With a Lower Success Rate)
Katy Salgado - November 27, 2025
TCP OS Fingerprinting: How Websites Detect Automated Requests (and How Proxies Help)
Katy Salgado - December 15, 2025
Analyzing Competitor TCP Fingerprints: Do Their Opt-In Networks Really Match Their Public Claims?