π‘οΈ Methodology Checklist
- Passive crawl:
wget --spider -r --no-parent [URL]or Burp passive - Check
/robots.txtand/sitemap.xmlfor hidden paths - Spider with Burp: enable passive spidering + Crawler for active
- Crawl with Hakrawler:
echo [URL] | hakrawler -d 3 - Extract all unique endpoints and parameters
- Note forms, file uploads, API endpoints, and auth pages
- Feed crawl results into active fuzzing (ffuf, gobuster)
π― Operational Context
Use when: Active recon phase β spider the web application to map all endpoints, JS files, hidden parameters, and forms before running vulnerability scanners.
Think Dumber First: Run hakrawler or katana against the target before Burp active scan. Crawlers find endpoints that wordlist brute cannot β especially JS-rendered SPAs. Always save output for ffuf parameter fuzzing.
Skip when: Application uses heavy bot protection (Cloudflare challenge) β passive crawl via gau instead.
β‘ Tactical Cheatsheet
| Command | Tactical Outcome |
|---|---|
pip3 install scrapy | Install Scrapy framework |
wget -O ReconSpider.zip https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.2.zip && unzip ReconSpider.zip | Download ReconSpider |
python3 ReconSpider.py http://[DOMAIN] | Crawl target β outputs results.json |
cat results.json | jq -r '.emails' | Extract found email addresses |
cat results.json | jq -r '.comments' | Extract HTML comments (dev notes, API keys) |
cat results.json | grep "reports" | Search crawl results for specific keywords |
curl http://[TARGET_IP]/robots.txt | Check robots.txt manually |
curl -v https://[TARGET_IP]/.well-known/openid-configuration | Check OIDC discovery endpoint |
curl http://[TARGET_IP]/.well-known/security.txt | Check security contact file |
π¬ Deep Dive & Workflow
Crawling vs. Fuzzing
- Crawling: Follows existing links found on pages β maps known structure
- Fuzzing: Guesses paths using wordlists β finds hidden/unlisted content
Crawling Strategies
- Breadth-First (Wide): Explore all links at current depth before going deeper β good for quick site map
- Depth-First (Deep): Follow one chain as far as possible before backtracking β good for reaching deep content
High-Value Data to Extract
| Data Type | Pentest Value |
|---|---|
| Links (internal/external) | Maps site structure, finds hidden subdomains in links |
| HTML Comments | Developer notes, TODOs, credentials, infrastructure hints |
| Metadata | Page authors, dates β valid usernames, software versions |
| Sensitive Files | .bak, .old, .conf, web.config, access_log |
| JS Files | API keys, endpoints, logic flaws |
| External Files | PDFs/docs β FOCA metadata analysis |
robots.txt Intelligence
robots.txt is designed to hide content from Google β it maps directly to sensitive areas.
Disallow: /admin/β target for manual access or dir brute-forceDisallow: /backup/β potential backup filesSitemap:β feeds to crawler for complete URL listAllow:inside aDisallowparent β reveals public sub-folders
Directives:
| Directive | Pentest Value |
|---|---|
Disallow | Hidden paths β immediate priority targets |
Sitemap | Full URL list β feed directly to crawler |
Crawl-delay | Server fragility indicator |
.well-known URIs (RFC 8615)
Standardized directory at /.well-known/ for security metadata.
| URI | Value |
|---|---|
security.txt | Internal emails, bug bounty scope |
openid-configuration | Critical β auth/token endpoints, JWKS URI, supported scopes |
change-password | Quick find for password reset functionality |
Correlation Examples
- Comment: βlegacy file serverβ + Crawler finds
/files/β check for directory listing - Metadata author: βjsmithβ + login page found β valid username for brute-force
- Crawler finds
config.php.bakβ source code disclosure
CPTS Tip: Always run passive crawler (ZAP Spider / Burp Spider) in background while manually browsing. Catches every click plus hidden HTML links.
π οΈ Troubleshooting & Edge Cases
| Problem | Cause | Fix |
|---|---|---|
| katana finds no JS endpoints | SPA with dynamic loading | Use -js-crawl flag; also capture with Burp proxy while manually browsing |
| Crawl misses authenticated areas | Session not passed | Use -H 'Cookie: session=...' or -config with auth headers |
| hakrawler hits rate limit | Too many concurrent requests | Reduce with -t 5 (5 threads); add -delay 200 for 200ms between requests |
| Crawl returns only root URL | robots.txt disallows all | Read robots.txt for disallowed paths β these are targets; crawl disallowed paths manually |
| JS file endpoints return 404 | Relative paths or old JS | Resolve relative to base URL; check if JS file itself is from an older bundle |
π Reporting Trigger
Finding Title: Web Application Endpoint Enumeration via Active Crawling Impact: Discovery of undocumented endpoints, hidden admin paths, and unauthenticated API routes that expand the exploitable attack surface beyond what is visible in the primary application flow. Root Cause: Application lacks a comprehensive endpoint inventory. Development and debug endpoints not removed from production deployment. Recommendation: Implement API gateway with explicit route allowlisting. Remove debug/dev endpoints before production deployment. Regular automated crawl audits as part of SDLC.