deans-dox
 / Linux / Playwright.md

Yes—but not with plain HTTP::Tiny (or curl) alone. If the page’s final HTML is produced by JavaScript, you need a real browser engine (headless Chromium/Firefox) to execute that JS and then extract from the rendered DOM. Perl can absolutely orchestrate that; the usual pattern is:

  1. Drive a headless browser (Chromium) from Perl
  2. Wait for the page to render
  3. Pull the final HTML (or run JS in-page)
  4. Apply your regex / extract links

Practical options from Perl

Option A: Perl drives headless Chromium via Playwright (recommended)

Playwright is modern and reliable for JS-heavy sites. From Perl you can either:

  • use a Perl module wrapper (exists, but quality varies), or
  • call the Playwright CLI / a tiny Node helper from Perl (most robust in practice).

High-level shape (no extra “relative URL” stuff, same idea as your script):

  • launch chromium headless
  • goto($url)
  • wait for networkidle (or a selector)
  • content() to get rendered HTML
  • run your regex on that HTML

Option B: Selenium (WebDriver) from Perl

This is the “classic” way. You run chromedriver + Selenium from Perl (Selenium::Remote::Driver), navigate, then fetch page source.

Pros: widely known Cons: more moving parts (driver versions, etc.)

Option C: “HTML snapshot” services / prerender endpoints

Sometimes a site already serves a “rendered” version (for bots) or has an API. This can avoid headless browsing entirely—but it’s site-specific.


A minimal, realistic approach: Perl + Playwright (Node helper)

If you’re okay with having Node installed, you can keep Perl as the main tool and use a tiny helper to render:

render.mjs (Node + Playwright):

import { chromium } from 'playwright';

const url = process.argv[2];
if (!url) {
  console.error('Usage: render.mjs <url>');
  process.exit(2);
}

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage({
  userAgent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
});
await page.goto(url, { waitUntil: 'networkidle' });
const html = await page.content();
console.log(html);
await browser.close();

Then your Perl script just does:

  • my $html = qx(node render.mjs "$url");
  • apply regex and print matches

This gives you “curl-like” behavior but with a real JS runtime.


What you should know / tradeoffs

  • Speed: headless browsers are slower than HTTP GET.
  • Detection: some “fancy” sites try to detect automation; Playwright/Selenium can still work, but sometimes you need extra steps (viewport, delays, stealthy settings).
  • Auth / cookies / consent banners: if the content appears only after a click or login, you’ll need to script that interaction.
  • Legality/ToS: some sites forbid automated scraping; worth checking.

If you tell me which direction you prefer:

  • Perl + Selenium
  • Perl + Playwright (Node helper)
  • Pure Perl module approach (if you really want “all Perl”)

…I’ll give you a drop-in script that matches your original CLI style: script <url> <regex> and prints unique matches, but backed by a headless browser for JS-rendered pages.