摘 要当今网络领域发展的时代背景下,传统的搜索引擎面临一些局限性挑战,尤其是 网络上有接近十分之三的网页信息重复,为了让开发者爬取质量更高的数据,并且爬取到不仅仅是表层网页,而是更多属于深层网页的信息。开发者引入网络爬虫技术,网络爬虫技术是按照实际需求爬取万维网信息资源的脚本或程序,也是现在通用浏览器的核心组件之一,网络爬虫涉及范围广泛,便利于人们的同时也有人用它做触犯法律的行为,根据爬取过程和内部结构可以分为四种类型,常用的是聚焦型、通用型的爬虫。 python作为最佳的爬虫工具,通过 python 的网络爬虫,我们能实现爬取需求数据,建模,分析的高效操作。 本文主要研究了 python 在做一般的网络爬虫过程的基本过程及 robots 协议及反爬道德规范,以及 python 在做网络爬虫前的计算机系统环境,语法结构,编译集成环境,正则表达式等解析工具比较分析,第三方库对比分析,scrapy 等开发框架对比,json,cvs文件及三种爬虫中常用数据库的连接及其技术中台的基本用法;最后通过新冠肺炎疫情数据爬取信息做相关模型分析及可视化处理。关键词:网络爬虫;python;正则表达式;scrapy;json ⅢAbstractUnder the background of the development of the Internet, the traditional search engine is facing some limitations and challenges. Especially, there are nearly three tenths of the web page information repetition on the Internet. In order to let developers crawl higher quality data, and crawl to not only the surface page, but more information belonging to the deep page. Developers introduce web crawler technology. Web crawler technology is a script or program that crawls the World Wide Web information resources according to the actual needs, and it is also one of the core components of the current general browser. Web crawler covers a wide range, which is convenient for people and people to use it to violate the law. According to the crawling process and internal structure, there are four types, focusing type and general-purpose type. Python as the best crawler tool, thro...