We host a set of “resource” pages – a collection of useful links for our users. For years we’ve had a script run daily – looping through each link and sending one php Guzzle HEAD request to make sure each page on the resource sites is active.
But over the past few years, I suspect as more and more sites adopt Cloudflare, sites are returning 403 codes to the HEAD request, and it’s getting to the point where it’s pretty useless to do this.
Is there a way to do this that isn’t going to get this traffic treated as malicious? I don’t need the content from the other sites… just simply to know if the pages are in good working order.
Here’s the PHP code I’m using:
$client = new Client();
$request = $client->head($encoded_link);
$request->setOptions(['userAgent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36']);
$response = $request->send();