Web Crawling

🛡️ Methodology Checklist

Passive crawl: wget --spider -r --no-parent [URL] or Burp passive
Check /robots.txt and /sitemap.xml for hidden paths
Spider with Burp: enable passive spidering + Crawler for active
Crawl with Hakrawler: echo [URL] | hakrawler -d 3
Extract all unique endpoints and parameters
Note forms, file uploads, API endpoints, and auth pages
Feed crawl results into active fuzzing (ffuf, gobuster)

🎯 Operational Context

Use when: Active recon phase — spider the web application to map all endpoints, JS files, hidden parameters, and forms before running vulnerability scanners. Think Dumber First: Run hakrawler or katana against the target before Burp active scan. Crawlers find endpoints that wordlist brute cannot — especially JS-rendered SPAs. Always save output for ffuf parameter fuzzing. Skip when: Application uses heavy bot protection (Cloudflare challenge) — passive crawl via gau instead.

⚡ Tactical Cheatsheet

Command	Tactical Outcome
`pip3 install scrapy`	Install Scrapy framework
`wget -O ReconSpider.zip https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.2.zip && unzip ReconSpider.zip`	Download ReconSpider
`python3 ReconSpider.py http://[DOMAIN]`	Crawl target — outputs `results.json`
`cat results.json \| jq -r '.emails'`	Extract found email addresses
`cat results.json \| jq -r '.comments'`	Extract HTML comments (dev notes, API keys)
`cat results.json \| grep "reports"`	Search crawl results for specific keywords
`curl http://[TARGET_IP]/robots.txt`	Check robots.txt manually
`curl -v https://[TARGET_IP]/.well-known/openid-configuration`	Check OIDC discovery endpoint
`curl http://[TARGET_IP]/.well-known/security.txt`	Check security contact file

🔬 Deep Dive & Workflow

Crawling vs. Fuzzing

Crawling: Follows existing links found on pages — maps known structure
Fuzzing: Guesses paths using wordlists — finds hidden/unlisted content

Crawling Strategies

Breadth-First (Wide): Explore all links at current depth before going deeper — good for quick site map
Depth-First (Deep): Follow one chain as far as possible before backtracking — good for reaching deep content

High-Value Data to Extract

Data Type	Pentest Value
Links (internal/external)	Maps site structure, finds hidden subdomains in links
HTML Comments	Developer notes, TODOs, credentials, infrastructure hints
Metadata	Page authors, dates → valid usernames, software versions
Sensitive Files	`.bak`, `.old`, `.conf`, `web.config`, `access_log`
JS Files	API keys, endpoints, logic flaws
External Files	PDFs/docs → FOCA metadata analysis

robots.txt Intelligence

robots.txt is designed to hide content from Google — it maps directly to sensitive areas.

Disallow: /admin/ → target for manual access or dir brute-force
Disallow: /backup/ → potential backup files
Sitemap: → feeds to crawler for complete URL list
Allow: inside a Disallow parent → reveals public sub-folders

Directives:

Directive	Pentest Value
`Disallow`	Hidden paths — immediate priority targets
`Sitemap`	Full URL list — feed directly to crawler
`Crawl-delay`	Server fragility indicator

.well-known URIs (RFC 8615)

Standardized directory at /.well-known/ for security metadata.

URI	Value
`security.txt`	Internal emails, bug bounty scope
`openid-configuration`	Critical — auth/token endpoints, JWKS URI, supported scopes
`change-password`	Quick find for password reset functionality

Correlation Examples

Comment: “legacy file server” + Crawler finds /files/ → check for directory listing
Metadata author: “jsmith” + login page found → valid username for brute-force
Crawler finds config.php.bak → source code disclosure

CPTS Tip: Always run passive crawler (ZAP Spider / Burp Spider) in background while manually browsing. Catches every click plus hidden HTML links.

🛠️ Troubleshooting & Edge Cases

Problem	Cause	Fix
katana finds no JS endpoints	SPA with dynamic loading	Use `-js-crawl` flag; also capture with Burp proxy while manually browsing
Crawl misses authenticated areas	Session not passed	Use `-H 'Cookie: session=...'` or `-config` with auth headers
hakrawler hits rate limit	Too many concurrent requests	Reduce with `-t 5` (5 threads); add `-delay 200` for 200ms between requests
Crawl returns only root URL	robots.txt disallows all	Read `robots.txt` for disallowed paths — these are targets; crawl disallowed paths manually
JS file endpoints return 404	Relative paths or old JS	Resolve relative to base URL; check if JS file itself is from an older bundle

📝 Reporting Trigger

Finding Title: Web Application Endpoint Enumeration via Active Crawling Impact: Discovery of undocumented endpoints, hidden admin paths, and unauthenticated API routes that expand the exploitable attack surface beyond what is visible in the primary application flow. Root Cause: Application lacks a comprehensive endpoint inventory. Development and debug endpoints not removed from production deployment. Recommendation: Implement API gateway with explicit route allowlisting. Remove debug/dev endpoints before production deployment. Regular automated crawl audits as part of SDLC.

Field Manual

Explorer

Web Crawling

🛡️ Methodology Checklist

🎯 Operational Context

⚡ Tactical Cheatsheet

🔬 Deep Dive & Workflow

Crawling vs. Fuzzing

Crawling Strategies

High-Value Data to Extract

robots.txt Intelligence

.well-known URIs (RFC 8615)

Correlation Examples

🛠️ Troubleshooting & Edge Cases

📝 Reporting Trigger

Graph View

Table of Contents

Backlinks

Field Manual

Explorer

Web Crawling

🛡️ Methodology Checklist

🎯 Operational Context

⚡ Tactical Cheatsheet

🔬 Deep Dive & Workflow

Crawling vs. Fuzzing

Crawling Strategies

High-Value Data to Extract

robots.txt Intelligence

.well-known URIs (RFC 8615)

Correlation Examples

🛠️ Troubleshooting & Edge Cases

📝 Reporting Trigger

🔗 Related Nodes

Graph View

Table of Contents

Backlinks