πŸ›‘οΈ Methodology Checklist

  • Passive crawl: wget --spider -r --no-parent [URL] or Burp passive
  • Check /robots.txt and /sitemap.xml for hidden paths
  • Spider with Burp: enable passive spidering + Crawler for active
  • Crawl with Hakrawler: echo [URL] | hakrawler -d 3
  • Extract all unique endpoints and parameters
  • Note forms, file uploads, API endpoints, and auth pages
  • Feed crawl results into active fuzzing (ffuf, gobuster)

🎯 Operational Context

Use when: Active recon phase β€” spider the web application to map all endpoints, JS files, hidden parameters, and forms before running vulnerability scanners. Think Dumber First: Run hakrawler or katana against the target before Burp active scan. Crawlers find endpoints that wordlist brute cannot β€” especially JS-rendered SPAs. Always save output for ffuf parameter fuzzing. Skip when: Application uses heavy bot protection (Cloudflare challenge) β€” passive crawl via gau instead.


⚑ Tactical Cheatsheet

CommandTactical Outcome
pip3 install scrapyInstall Scrapy framework
wget -O ReconSpider.zip https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.2.zip && unzip ReconSpider.zipDownload ReconSpider
python3 ReconSpider.py http://[DOMAIN]Crawl target β€” outputs results.json
cat results.json | jq -r '.emails'Extract found email addresses
cat results.json | jq -r '.comments'Extract HTML comments (dev notes, API keys)
cat results.json | grep "reports"Search crawl results for specific keywords
curl http://[TARGET_IP]/robots.txtCheck robots.txt manually
curl -v https://[TARGET_IP]/.well-known/openid-configurationCheck OIDC discovery endpoint
curl http://[TARGET_IP]/.well-known/security.txtCheck security contact file

πŸ”¬ Deep Dive & Workflow

Crawling vs. Fuzzing

  • Crawling: Follows existing links found on pages β€” maps known structure
  • Fuzzing: Guesses paths using wordlists β€” finds hidden/unlisted content

Crawling Strategies

  • Breadth-First (Wide): Explore all links at current depth before going deeper β€” good for quick site map
  • Depth-First (Deep): Follow one chain as far as possible before backtracking β€” good for reaching deep content

High-Value Data to Extract

Data TypePentest Value
Links (internal/external)Maps site structure, finds hidden subdomains in links
HTML CommentsDeveloper notes, TODOs, credentials, infrastructure hints
MetadataPage authors, dates β†’ valid usernames, software versions
Sensitive Files.bak, .old, .conf, web.config, access_log
JS FilesAPI keys, endpoints, logic flaws
External FilesPDFs/docs β†’ FOCA metadata analysis

robots.txt Intelligence

robots.txt is designed to hide content from Google β€” it maps directly to sensitive areas.

  • Disallow: /admin/ β†’ target for manual access or dir brute-force
  • Disallow: /backup/ β†’ potential backup files
  • Sitemap: β†’ feeds to crawler for complete URL list
  • Allow: inside a Disallow parent β†’ reveals public sub-folders

Directives:

DirectivePentest Value
DisallowHidden paths β€” immediate priority targets
SitemapFull URL list β€” feed directly to crawler
Crawl-delayServer fragility indicator

.well-known URIs (RFC 8615)

Standardized directory at /.well-known/ for security metadata.

URIValue
security.txtInternal emails, bug bounty scope
openid-configurationCritical β€” auth/token endpoints, JWKS URI, supported scopes
change-passwordQuick find for password reset functionality

Correlation Examples

  • Comment: β€œlegacy file server” + Crawler finds /files/ β†’ check for directory listing
  • Metadata author: β€œjsmith” + login page found β†’ valid username for brute-force
  • Crawler finds config.php.bak β†’ source code disclosure

CPTS Tip: Always run passive crawler (ZAP Spider / Burp Spider) in background while manually browsing. Catches every click plus hidden HTML links.


πŸ› οΈ Troubleshooting & Edge Cases

ProblemCauseFix
katana finds no JS endpointsSPA with dynamic loadingUse -js-crawl flag; also capture with Burp proxy while manually browsing
Crawl misses authenticated areasSession not passedUse -H 'Cookie: session=...' or -config with auth headers
hakrawler hits rate limitToo many concurrent requestsReduce with -t 5 (5 threads); add -delay 200 for 200ms between requests
Crawl returns only root URLrobots.txt disallows allRead robots.txt for disallowed paths β€” these are targets; crawl disallowed paths manually
JS file endpoints return 404Relative paths or old JSResolve relative to base URL; check if JS file itself is from an older bundle

πŸ“ Reporting Trigger

Finding Title: Web Application Endpoint Enumeration via Active Crawling Impact: Discovery of undocumented endpoints, hidden admin paths, and unauthenticated API routes that expand the exploitable attack surface beyond what is visible in the primary application flow. Root Cause: Application lacks a comprehensive endpoint inventory. Development and debug endpoints not removed from production deployment. Recommendation: Implement API gateway with explicit route allowlisting. Remove debug/dev endpoints before production deployment. Regular automated crawl audits as part of SDLC.