Hunting for Truffles

So recently, a friend of mine on Discord forced me to join a random server for web devs, but i've had experiences where these 'web dev' servers are actually filled with people who have done only one project in their entire life, a portfolio, with Hugo and hosted on vercel, and then starts to brag about how much they know about the web, yet i still joined because of that guy's insistence.

and one day, a guy came and said he's creating a personal website and posted the address. By this time i was writing a simple CLI tool for extracting OSINT on domains in Node.js, what a coincidence. So to flex, I said what the person was hosting their site on and their DNS service and so on with the tool, but the person, without saying anything like 'cool bro' or 'that's impressive' the person just said, 'due to security reasons i'm not interested in saying where i'm hosting my website...', cmon, i don't know if this person doesn't know what's he doin, but this is basic openly available information, and if they don't know what info is orivate and what is not, they are really gonna mess up.

So i challenged them that their entire infrastructure is insecure and they don't know what's they're doing, because i can extract most about how their infra is like and how many vulns i can find in it with my tool, even if my tool at that moment could only extract basic stuff, and this was a prestige issue. But i was looking on to making at sophisticated as possible.

So it was first a single-file CLI script which just took a target URL, which then sent a HTTP request, and output the respones text. It used the built-in node modules like node:http and node:https to perform basic GET requests and used the readline for handling the user input sequentially, it then just did a console.log output of the response from the target server. And that was the MVP

Next was the addition of the port arrays which needed to be scanned, the RegEx sigs for secrets hunting and all the browser profiles and a collection of tevh stacks sigs for a feature for identifying the technologies used in the target server. i knew it was a bit(read: a fucking lot) cluttered, but i didn't really care. After this i just stopped doing this for sometime, going back to a different project of a chatting or communication system i was developing called LRIS(Live Relayed Interaction System).

And after 3 days of inactivity, i returned, now my motivation was not to show that person taht their infra is screwd, but to make a really sophisticated and well-engineered piece of reconnaissance tool. So now, i coded the tool forSubdomain enumeration, and in that i made the tool take the domain it was given and then to comb through the entire HackerTarget, crt.sh and the Wayback Machine for the enumeration for all subdomains having that given domain. i was nervous about the accuracy of using these zero-cost, sign-up free services instead of something more dedicated like Shodan or Censys, but when i checked it, to be honest, the speed and accuracy genuinely surprised me, when i tested.it, it was almost always accurate and working well. The only wrinkle, was that crt.sh being one of the most comprehensive certificate registry, is annoyingly rate-limiting, not blaming.them, as if a service is completely free and this good, you really have to rate-limit for stopping abuse.

Then was the TCP/UDP scanning, anyway by now in the tool, for versatility i had made two modes, a web mode and a non-web mode, the web as the name suggested was for the web related scanning while the non-web was for stuff like mail servers(forgot to say the tool can find if t a given domain is having a connected mail server or not and its address), Redis and other database servers. And now the TCP/UDP scanning, it was the hardest by far, for now. For this, instead of something like ping or some other HTTP libraries i used the native Node.js modules like node:net and node:dgram and it did it well, but WITH A LOT of false positives. so to prevent this, and then with some work, i made a kinda elaborate system of sig-based validation, stateful timing metrics, structural stability testing and heuristic pattern checks.

Standard port scanners flag a port as "open" if the initial TCP three-way handshake (SYN -> SYN-ACK -> ACK) completes. However, in modern networks, firewalls, load balancers, or software proxies often accept connections on any port (acting as a "catch-all" or "port spoofing" setup), leading to inaccurate results. And TrufflePigg counters this by immediately transmitting specific protocol-compliant binary structures once a connection is established. A true target service will process the packet and return a valid signature, whereas a dead socket or a basic firewall proxy will typically time out or drop the connection. In the sniffer.js, I constructed Buffer objects corresponding to database protocols

// From sniffer.js
const DB_PROBES = {
  5432: Buffer.from([0x00, 0x00, 0x00, 0x08, 0x04, 0xd2, 0x16, 0x2f]), // PostgreSQL Startup Packet
  27017: Buffer.from([ // MongoDB IsMaster OpCode Packet
    0x3f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xd4, 0x07, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00, 0x61, 0x64, 0x6d, 0x69, 0x6e, 0x2e, 0x24, 0x63, 0x6d, 0x64, 0x00, 0x00,
    0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0x13, 0x00
    // ... complete payload buffer
  ])
};

function probeDatabase(host, port, binaryProbePayload) {
  return new Promise((resolve) => {
    const socket = new net.Socket();
    let responseBuffer = Buffer.alloc(0);

    socket.setTimeout(4000);
    
    socket.connect(port, host, () => {
      // Actively transmit the protocol buffer to prompt a state change
      socket.write(binaryProbePayload);
    });

    socket.on('data', (chunk) => {
      responseBuffer = Buffer.concat([responseBuffer, chunk]);
      // If we grab enough signature content, drop early to save network cycles
      if (responseBuffer.length > 256) socket.destroy(); 
    });

    socket.on('close', () => {
      if (responseBuffer.length > 0) {
        // Strip non-printable ASCII control characters to clean the banner
        const readableBanner = responseBuffer.toString('utf8').replace(/[^\x20-\x7E]/g, '');
        resolve({ open: true, banner: readableBanner });
      } else {
        resolve({ open: true, banner: 'Empty signature context' });
      }
    });

    socket.on('error', () => resolve({ open: false, banner: null }));
  });
}

Unlike TCP, UDP is entirely connectionless. There is no handshake mechanism to verify if a port exists or is active. If you send a UDP packet to an open port, the service may remain silent unless given highly structured instructions. If the port is closed, the host machine's OS kernel generally issues an ICMP Destination Unreachable packet back to the sender. TrufflePigg handles UDP checking by binding an active UDP socket client via dgram.createSocket('udp4'), sending specialized protocol blocks, and closely monitoring the loop for low-level socket system errors like ECONNREFUSED (which indicates the kernel received an ICMP unreachable message).

// Abstracted representation of UDP parsing from sniffer.js
function probeUdpPort(host, port, payloadBuffer, timeout = 3000) {
  return new Promise((resolve) => {
    const client = dgram.createSocket('udp4');
    let completed = false;

    client.send(payloadBuffer, 0, payloadBuffer.length, port, host, (err) => {
      if (err) {
        client.close();
        return resolve({ open: false, error: err.message });
      }
    });

    // Capture standard data responses from the service
    client.on('message', (msg) => {
      if (completed) return;
      completed = true;
      client.close();
      resolve({ open: true, responded: true, data: msg });
    });

    // Capture ICMP unreachable signals captured by Node's internal libuv translation
    client.on('error', (err) => {
      if (completed) return;
      completed = true;
      client.close();
      // ECONNREFUSED implies the host is active but definitively rejects the service port
      resolve({ open: false, responded: false, reason: err.code });
    });

    setTimeout(() => {
      if (completed) return;
      completed = true;
      client.close();
      // If no response and no ICMP error, it could be open/filtered (typical UDP state)
      resolve({ open: true, responded: false, reason: 'Timeout (No ICMP Refusal)' });
    }, timeout);
  });
}

Then it was the heuristics for the detection of honeypot systems like Cowrie, what it does is that by identifying impossible network conditions—such as a single host having an extraordinarily high number of open ports, repeating the exact same service banner across multiple unrelated ports, or explicitly matching known honeypot software (like Cowrie or Dionaea).

// From scanner.js
function analyzeForHoneypot(openPorts, collectedBanners) {
  const anomalies = [];
  const bannerCounts = {};

  // 1. Density Filter: A normal host rarely exposes dozens of unique databases/infra streams
  if (openPorts.length > 12) {
    anomalies.push(`High Port Density Detection: ${openPorts.length} ports open.`);
  }

  // 2. Structural Duplication Filter:
  // Detects if completely different services return identical raw string data
  for (const banner of Object.values(collectedBanners)) {
    if (banner && banner !== 'Connection accepted but payload returned empty string') {
      bannerCounts[banner] = (bannerCounts[banner] || 0) + 1;
    }
  }

  for (const [banner, count] of Object.entries(bannerCounts)) {
    if (count > 3) {
      anomalies.push(`Honeypot Flag: Same signature data repeated across ${count} ports.`);
    }
  }

  return {
    isHoneypot: anomalies.length > 0,
    reasons: anomalies
  };
}

Now it was time for the part which took the longest time and was the hardest of all, which was what i call the web application intelligence engine(long name ig) and rightfully this was my favorite part and the most engineered part of the tool. The first thing was the tech stack fingerprinting which was a full-fledge process for detecting all the underlying Content Management Systems(CMS)s, frameworks or the web server, this was an advancement of the old hoster look-up. Rather than relying on a single loose match, I coded the tool to run a heuristic evaluation of checking matches of various domains. It analysed the DOM in which: the tool fetches the landing page (/), it normalizes the text to lowercase and evaluates it against array-based regular expressions (sig.dom). This layer targets hardcoded signatures that developers rarely clean up, including; Injected generator meta tags (e.g., meta name="generator" content="WordPress">), Specific path formatting conventions (e.g., wp-content/themes or _next/static), Unique variable structures in global scripts (e.g., window.Vue or drupalSettings). Next was the metadata and cookie auditing, in which the engine processed the response headers in tandem with any active tracking tokens, for headers, the tool inspects operational identifiers like X-Powered-By: PHP, X-AspNet-Version, or reverse-proxy footprints. And for cookies, it inspects state tokens (like laravel_session, django_csrftoken, or PHPSESSID).

// Validates if either the headers or cookie structures cross-match the platform signatures
const headerMatch = sig.headers && Object.entries(sig.headers).some(([key, regex]) => 
  headers[key] && regex.test(headers[key])
);

if (headerMatch || (sig.cookies && sig.cookies.some(re => cookies.some(c => re.test(c))))) {
  score++;
}

And if none of this can't get into a conclusion, the tool falls back to active file probing (sig.files). This involves looking for explicit, non-indexed developer endpoints (e.g., /wp-login.php, /vendor/composer/, or /package.json).

And to execute all this without getting rate-limited in 5 minutes, i coded some mechanics to the engine, one was the tstacksleep, in which instead of executing sequential checks at high speeds, the engine sits on an asynchronous window loop that introduces randomized timing intervals between 300ms and 900ms per attempt. This disrupts the uniform request pattern that automated defensive monitors typically flag, and this was also customizable via a flag. Then was a mechanism of the verification function using the efficient HTTP HEAD requests rather than full GET operations, significantly dropping network overhead. Crucially, it looks for both 200 OK and 403 Forbidden status codes:

const req = protocol.request(url, options, (res) => {
  res.resume();
  // A 403 Forbidden means the directory structure exists, but directory listing is off!
  resolve(res.statusCode === 200 || res.statusCode === 403); 
});

After this, i coded a new thing, an automated vulnerability scan, which could prove to be sometimes extremely useful, here TrufflePigg attempts to automate the triage phase by mapping software banners directly to known public vulnerabilities via the National Vulnerability Database (NVD) API. The only wrinkle with this was the same as crt.sh, it was also strictly rate-limited, so i coded a cool-down period for the feature and it's not a big issue as nobody is gonna spam the command each 2 seconds. But this was worth it, because whenever i did a scan, it accurately showed me all the linked CVEs and yeah, but don't worry i never used them.

Then i coded a feature for scraping the robots.txt and the sitemap.xml, then fetching and parsing them and then giving an option for the user to download them, because this could reveal, some very sensitive info if the developer had misconfigured it. Then i added a crawlSitemap function recursively traverses sitemap indices (sitemapindex) up to 15 levels deep, extracting explicit location endpoints (urlset). Now it could map the discovered URLs against a keyword matrix (admin, login, api, dashboard, config, dev) to filter and isolate high-value assessment targets automatically.

if (allExtractedUrls.size > 0) {
  log(`          [!] SUCCESS: Extracted ${allExtractedUrls.size} unique URLs from sitemaps.`);
  
  const interestingKeywords = ['admin', 'login', 'api', 'dashboard', 'user', 'config', 'dev'];
  const juicyUrls = Array.from(allExtractedUrls).filter(url => 
    interestingKeywords.some(keyword => url.toLowerCase().includes(keyword))
  );
  // Instantly bubbles these high-value links to the top of the interface
}

Now it was my favorite of all, the Shadow Trust Analysis, the thing is that, a significant portion of modern web vulnerability surface area involves resources hosted outside the target's direct infrastructure. If an asset scripts a resource from an expired domain or bucket, the client remains vulnerable to third-party script manipulation. Using cheerio, TrufflePigg parses all asset tags (script, link, img, iframe, form, a) to discover foreign host domains, ignoring domains present in a local whitelist (WHITELIST_DOMAINS). Then it queries DNS (dns.resolve4) to verify if the third-party domain has lapsed into an unallocated or dead state (ENOTFOUND). If an asset points to AWS S3 infrastructure, it contacts the bucket directly to check if it returns a string matching NoSuchBucket. This signature explicitly confirms that the bucket name has been abandoned and is vulnerable to hijacking. It also calculates mutations of the resource provider's core domain name across standard alternative TLD extensions to detect if alternative paths are unprotected.

const isDead = await checkHostAvailability(hostname);

if (isDead) {
  log(`          [!!!] CRITICAL SHADOW-TRUST ALERT: Abandoned Dependency Found!`);
} else {
  if (hostname.includes('s3.amazonaws.com') || hostname.endsWith('amazonaws.com')) {
    const isBucketClaimable = await checkS3Claimable(hostname);
    if (isBucketClaimable) {
      log(`          [!!!] CRITICAL S3 HIJACK ALERT: Orphaned Bucket Detected!`);
    }
  }
}

Then it was the scrapeJsForSecrets function, where the scanner extracts all relative and absolute script paths linked within the source DOM. It downloads the text content of each source file via the proxy pipeline. It scans the file content against regular expressions (SECRET_PATTERNS) checking for items like Google API keys, AWS credentials, and Stripe endpoints. And toeliminate junk data or placeholder variables, the tool verifies findings against isHighFidelitySecret(), which drops findings that contain low Shannon entropy or nearby indicator strings such as example, test, or placeholder.

const uniqueFinds = [...new Set(finds.map(f => f[0]))];
for (const f of uniqueFinds) {
   if (isHighFidelitySecret(f, jsBody)) {
       const entropyScore = calculateShannonEntropy(f).toFixed(2);
       log(`              -> [${name}]: ${f} (Entropy: ${entropyScore})`);
       secretsFound++;
   }
}

Then there was a breadcrumbScan, an optional feature through flag contextual crawling which tracked application workflow transitions up to a user-configured limit (maxHops).

Now after this phase, i came to the final phase, or i that's what i thought. This was the anti-detection phase, trying my best to make the tool as silent as possible and try to make it go undetected by IDSs and WAFs. The first thing i did was to implement some spoofing methods. The reason why recon scripts or bot scripts in general get caught by WAFs or IDSs is because modern WAFs and IDSs are awfully good at scanning the user(or the script) to find if they have 'natural' headers or the timing in using processes. And as y'all clearly know, a headless script scraping a page wouldn't have a header looking like: User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36, not like User-Agent: axios/1.6.0 Node.js/v20.10.0, so what i did was that, i made TrufflePigg reference a pretty extensive pre-defined BROWSER_PROFILES. When a request initializes, it selects a browser environment profile at random and attaches matching, granular metadata parameters to keep the transaction fingerprint consistent.

// Realized header map generated during request builds
const req = client.request({
  [HTTP2_HEADER_METHOD]: options.method || 'GET',
  [HTTP2_HEADER_PATH]: url.pathname + url.search,
  [HTTP2_HEADER_SCHEME]: url.protocol.replace(':', ''),
  [HTTP2_HEADER_AUTHORITY]: url.hostname,
  [HTTP2_HEADER_USER_AGENT]: profile.ua,
  
  // Client Hints injection mapping:
  'sec-ch-ua': profile.ch['sec-ch-ua'],
  'sec-ch-ua-mobile': profile.ch['sec-ch-ua-mobile'],
  'sec-ch-ua-platform': profile.ch['sec-ch-ua-platform'],
  
  ...getWafEvasionHeaders(), 
  'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'accept-language': 'en-US,en;q=0.5',
  ...options.headers
});

The other common thing that detection systems used was, to find out if the user was on HTTP/1.1 or HTTP/2, the reason is that, the legacy 1.1 is actually the default traffic method, and this has proven a lot of times, so most systems are built to flag or block HTTP/1.1 traffic, but HTTP/2 is the configuration used by almost all browsers. Because of that i used Node's native node:http2 library to fool these systems and establish asynchronous, long-lived multiplexed channels. Then i fine-tuned configuration states (enablePush: false, initialWindowSize: 6291456) to mimic the flow control behaviors of standard desktop browsers like Chromium. The next thing i added was a bit more paranoid but kinda essential, it was that i added a feature of routing all the traffic, all except for the ones which talks with the external source APIs, through Tor or a list of proxies, i did this through the SocksClient library. And if that was a bit too much you could also use the --protate feature which routes the traffic through a list of proxies but this is a bit faster option but inevitable less anonymous. i then coded a reverse-proxy header spoofing where the tool calls getWafEvasionHeaders() to supplement outgoing streams with fake routing tracking records. By injecting internal loopback ranges or private network signatures into headers like X-Forwarded-For or True-Client-IP, it attempts to pass traffic through upstream appliances without triggering block rules.

// Header generation logic referenced from utils.js
function getWafEvasionHeaders() {
  const spoofedIps = ['127.0.0.1', 'localhost', '::1', '10.0.0.1', '192.168.1.1'];
  const randomIp = spoofedIps[Math.floor(Math.random() * spoofedIps.length)];
  return {
    'X-Forwarded-For': randomIp,
    'X-Originating-IP': randomIp,
    'X-Remote-IP': randomIp,
    'True-Client-IP': randomIp,
    'X-Custom-IP-Authorization': randomIp
  };
}

Then i coded a user-configurable request jitter, where rather than delivering requests at fixed intervals, scanner.js integrates variable timing controls (tstacksleep and sleep methods). During tech-stack analysis and endpoint discovery routines, the system pauses execution using dynamic mathematical ranges (e.g., between 300ms and 900ms) to mirror human surfing behaviors and avoid threshold-based rate limits.

And finally(or i thought) i was preparing it for release, first Codeberg, then GitHub and also a GitHub package(npm), why not npm itself? Because npm has this insanely annoying authentication system and is just not my thing. But then i re-checked the code, damn it was 3000-something monolith, i was surprised but didn't think much about maintainability, as i could easily find anything by my mind. So for a final test, i gave the code to some of my friends at the Hacker Webring, a person first checked it and said 'damn this could use a bit of separating', and i already knew i should do something about that, and then after sometime, the maintainer of the ring, Dmi shared their comment, '...it's definitely unmaintainable.' and Jak2k said, 'The code really looks messy and unmaintainable', then i was sure i needed to REALLY modularize it, and in 7-8 hours, i had successfully modularized it into 6 parts; TrufflePigg.js: The interface, argument parsing, and orchestration engine. sniffer.js: Handles low-level TCP/UDP socket connections, banner grabbing, custom protocol payloads (SMB, DBs), and DNS/WHOIS interactions. scanner.js: The brain of the tool. It processes the raw data, identifies honeypots, fuzzes directories, analyzes DOM elements, and hunts for secrets. networker.js: Manages HTTP/2 connections, Tor routing, SOCKS5 proxy rotation, and TLS fingerprint evasion. utils.js: Contains string manipulation, entropy calculations, similarity algorithms, and humanized request header generation. data.js: Stores technology signatures, common ports, WAF identifiers, regex patterns for secrets, and byte-level payloads for infrastructure probes.

And with all that I PUSHED THE CODE. Nice.