闲来无事,做个快速收集企业信息导出Excel表的程序。所以…嘿嘿,开始对天眼查进行研究,废话不多说。
一、天眼查网站地址:https://www.tianyancha.com,到天眼查网站后例如:查询关键字:教育,天眼查说查询到100000+条企业信息,但是当你去翻页看的时候会发现在不登录的时候只能查看2页,后面就提示你登录查看更多了,那就登录一下,反正天眼查有短信快捷登录,登陆后,着手分析,(建议使用谷歌浏览器)F12调出开发者工具,Ctrl+Shift+C 点击咱们需要拿下来的信息块,嘿嘿…嘻嘻原来全在下图红框节点中啊!
然后知道它在这个区域,那么怎么把这个网页拿下来呢?
Java自带类就能实现这个问题!java.net.HttpURLConnection包就能模拟浏览器访问,直接上代码:
package com.zsx.crawler.utils.TianYanChaCompanyCrawler;
import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
public class WebUtil {
public static String getPageContent(String url){
StringBuffer sb = new StringBuffer();
try {
// 建立连接
URL u = new URL(url);
HttpURLConnection httpUrlConn = (HttpURLConnection) u.openConnection();
httpUrlConn.setDoInput(true);
httpUrlConn.setRequestMethod("GET");
//设置请求头
// httpUrlConn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
// httpUrlConn.setRequestProperty("Accept-Encoding","gzip, deflate, br");
// httpUrlConn.setRequestProperty("Accept-Language", "zh-CN,zh;q=0.9");
// httpUrlConn.setRequestProperty("Connection", "keep-alive");
// httpUrlConn.setRequestProperty("Host", "www.tianyancha.com");
// httpUrlConn.setRequestProperty("Referer", "https://www.tianyancha.com/");
// httpUrlConn.setRequestProperty("Upgrade-Insecure-Requests", "1");
httpUrlConn.setRequestProperty("Cookie",这里就写使用浏览器访问天眼查携带的"小点心");
httpUrlConn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36");
// 获取输入流
InputStream is = httpUrlConn.getInputStream();
// 将字节输入流转换为字符输入流
InputStreamReader isr = new InputStreamReader(is, "utf-8");
// 为字符输入流添加缓冲
BufferedReader br = new BufferedReader(isr);
// 读取返回结果
String data = null;
while ((data = br.readLine()) != null) {
sb.append(data);
System.out.println(data);
}
// 释放资源
br.close();
isr.close();
is.close();
httpUrlConn.disconnect();
} catch (Exception e) {
e.printStackTrace();
}
return sb.toString();
}
}
调用该方法打印返回值或者保存到txt文件就能看到,成功将该网站代码获取到,拿到后分析网站代码发现,噢…….原来还有隐藏域啊,隐藏域中竟然放了该页企业信息JSON数据,那就更简单了,直接从JSON数据中拿出想要的数据,存到Excel表就OK了!哈哈
上代码:
package com.zsx.crawler.utils.TianYanChaCompanyCrawler;
import org.json.JSONObject;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CrawlerCompanyUtil {
public static void main(String []args){
String web = WebUtil.getPageContent("https://www.tianyancha.com/search?key=%E6%95%99%E8%82%B2%E7%A7%91%E6%8A%80&base=bj");
Pattern pageCount = Pattern.compile("<div class=\"result-footer\">.*?</div>");
Matcher matcher1 = pageCount.matcher(web);
while (matcher1.find()){
String group = matcher1.group();
System.out.println("页数:" + group);
}
Pattern companyInfo = Pattern.compile("<span class=\"tt hidden\">.*?</span>");
Matcher matcher = companyInfo.matcher(web);
while(matcher.find()){
String group = matcher.group();
// System.out.println(group);
String eachGroup = group.substring(group.indexOf("<span class=\"tt hidden\">")+24,group.indexOf("</span>"));
// System.out.println(eachGroup);
JSONObject json = new JSONObject(eachGroup);
// System.out.println(json);
//公司名称
String companyName = json.get("name").toString();
//法人
String legalperson = json.get("legalPersonName").toString();
//注册资本
String registeredfund = json.get("regCapital").toString();
//注册时间
String registeredtime = json.get("estiblishTime").toString();
//电话列表
String phone = json.get("phoneList").toString();
//邮箱列表
String email = json.get("emailList").toString();
//注册地址
String address = json.get("regLocation").toString();
String qita = "经营领域:"+ json.get("businessScope") + "\n"+json.get("matchField");
System.out.println(companyName);
System.out.println(legalperson);
System.out.println(registeredfund);
System.out.println(registeredtime);
System.out.println(email);
System.out.println(address);
System.out.println(qita);
System.out.println();
}
}
}
将代码存入数据库或者导出Excel表就看咱们的心情了!
不过天眼查只允许普通用户查看前5页内容,所以我又去研究了启信宝网站,下篇咱们说说启信宝!爬虫启信宝文章中富含详细数据导出Excel表格代码,并且无限爬取数据,传送门——>Java爬虫启信宝
最后给各位看官来波福利!
阿里云服务器代金券和折扣免费领:https://promotion.aliyun.com/ntms/yunparter/invite.html?userCode=ypbt9nme
性能级主机2-5折:https://promotion.aliyun.com/ntms/act/enterprise-discount.html?userCode=ypbt9nme
新用户云通讯专享8折:https://www.aliyun.com/acts/alicomcloud/new-discount?userCode=ypbt9nme
新老用户云主机低4折专项地址:https://promotion.aliyun.com/ntms/act/qwbk.html?userCode=ypbt9nme
680元即可注册商标专项地址:https://tm.aliyun.com/?userCode=ypbt9nme