php – 如何仅加载html(并跳过媒体文件)

2024年1月13日 303次阅读

我正在优化我的简单网络爬虫(目前使用
PHP / curl_multi).

目标是在智能的同时抓取整个网站,并滑动非HTML内容.我尝试使用nobody,并且只发送HEAD请求,但这似乎并不适用于每个网站(某些服务器不支持HEAD),导致exec长时间暂停(有时比加载页面本身长得多).

有没有其他方法来获取页面类型而不下载整个内容或强制CURL放弃下载,如果文件不是HTML？

(编写我自己的http客户端不是一个选项,因为我打算以后使用CURL函数作为cookie和ssl).

最佳答案我没试过,但是我看到了CURLOPT_PROGRESSFUNCTION.我打赌你可以逐步读取响应以查找内容类型标题,如果你对下载的内容不感兴趣,则可能是
curl_close()句柄.

CURLOPT_PROGRESSFUNCTION     The name of a callback function
where the callback function takes three parameters. The first is the
cURL resource, the second is a file-descriptor resource, and the 
third is length. Return the string containing the data.

http://www.php.net/manual/en/function.curl-setopt.php