current position:Home>Python and Java crawl personal blog information and export it to excel

Python and Java crawl personal blog information and export it to excel

2022-01-31 07:44:35 wshanshi

One 、 Scene analysis

Today, let's climb my personal blog , Summarize the title and address of my article into a Excel in , Convenient view .

 Please add a picture description

wshanshi What bad thoughts can you have ? She's just ...... I just want to summarize my article and related address .......

Two 、 Simple interface analysis :

Developers know , Turn on debug mode . choice Elements, Then find the box where the article is stored , Here's the picture .

 Insert picture description here Found after analysis , Each box contains information about an article .  Insert picture description here Look carefully , You can find all the article Labeled class It's all the same .  Insert picture description here Ouch! , They all look the same , That's good . Combine tags through class selectors , You can get it in a minute .  Insert picture description here

Click any article div, You will find that there are article hyperlinks , title , Description information .  Insert picture description here Combined with the ultimate goal of this operation , The following operation steps are obtained .  Insert picture description here

The scene is also analyzed , Just go white .

The landlord will use two methods (Python、Java) Data collection of personal articles . It's all done ......  Insert picture description here

3、 ... and 、Python Realization way

Sample version :Python 3.8

Installation package :requests、beautifulsoup4、pandas

Library package installation command (windows Next )

pip install requests	
pip install beautifulsoup4
pip install pandas	
 Copy code 

3.1、 Common library descriptions


What is? Requests?Requests What are the advantages 、 shortcoming ?

Chinese network address…

3.1.2、Beautiful Soup 4.4.0

explain : Is one can from HTML or XML Extracting data from a file Python library .

Chinese network address…


explain : Powerful Python Data analysis support library

Chinese site

See the official website for specific use details happy ah , There is no more introduction here !!!

3.2、 Code example

Post code directly : Get the title of the article 、 Integrate article links , Export to .csv file .

# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
import pandas as pd
import openpyxl
import re
import os

def is_in(full_str, sub_str):
    return full_str.count(sub_str) > 0

def blogOutput(df,url):
    # if file does not exist write header 
    if not os.path.isfile(url):
        df.to_csv(url, index = False)
    else: # else it exists so append without writing the header
        df.to_csv(url, mode = 'a', index = False, header = False)

if __name__ == '__main__':
    target = ''
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36 Edg/87.0.664.47'}
    res = requests.get(target,headers=headers)
    #div_bf = BeautifulSoup(res.text)
    soup = BeautifulSoup(res.text, 'html.parser')
    # Define the output array 
    for item in soup.find_all("article",{"class":"blog-list-box"}):
        # Extract article title 
        # Extract the article address link 
        data = []
    df = pd.DataFrame(result,columns=[' Article title ', ' Article address '])
    #  Call the function to write data to the table 
    blogOutput(df, 'F:/blog_result.csv')
    print(' Output finished ')
 Copy code 

After coding , Just run... Wow . As shown in the figure below ,Run( Shortcut key F5).  Insert picture description here

Prompt that the output is complete , You can view the exported file .  Insert picture description here

Tolerable , Got the data !!!Python There are so many methods to demonstrate ( After all, I'm not good at ), Let's see Java my dear friend .

 Please add a picture description

Four 、Java Realization way

Java Operation and use Jsoup library , Substantive operation Dom.

Yibai tutorial website ( link )…

4.1、 Environmental Science 、 Library package

Jsoup Maven:

<!-- -->
 Copy code 

4.2、 Code example

Definition BlogVo class

public class BlogVo implements Serializable {
     *  Article title 
    private String title;
     *  Article address 
    private String url;
    @Excel(colName = " Article title ", sort = 1)
    public String getTitle() {
        return title;
    public void setTitle(String title) {
        this.title = title;
    @Excel(colName = " Article title ", sort = 2)
    public String getUrl() {
        return url;
    public void setUrl(String url) {
        this.url = url;
 Copy code 

service Interface

 *  Get the extracted blog information 
 * @return
List<BlogVo> getBlogList();

 *  export csv file 
 * @param httpServletResponse
 * @throws Exception
void export(HttpServletResponse httpServletResponse) throws Exception;
 Copy code 

serviceImpl Implementation class

public List<BlogVo> getBlogList() {
    List<BlogVo> list = new ArrayList<>();
    try {
        Document document = Jsoup.connect("").timeout(20000).get();

        Elements e = document.getElementsByClass("blog-list-box");
        Elements h4 =".blog-list-box-top").select("h4");
        Elements a =".blog-list-box").select("a");
        List<String> h4List = new ArrayList<>();
        List<String> aList = new ArrayList<>();
        h4.forEach(item -> {
        a.forEach(item -> {
            String href = item.attr("href");
        for (int i = 0; i < h4List.size(); i++) {
            BlogVo blogVo = new BlogVo();
    } catch (Exception e) {
    return list;

public void export(HttpServletResponse httpServletResponse) throws Exception {
    new ExcelExportUtils().export(BlogVo.class, getBlogList(), httpServletResponse, "blog");
 Copy code 

controller Control layer

 *  export csv file 
 * @param response
 * @throws Exception
public void getExport(HttpServletResponse response) throws Exception {
 Copy code 

Custom annotation class (Excel )

package com.wshanshi.test.entity;

import java.lang.annotation.Documented;
import java.lang.annotation.ElementType;
import java.lang.annotation.Retention;
import java.lang.annotation.RetentionPolicy;
import java.lang.annotation.Target;

public @interface Excel {
    public String colName();   // Name 

    public int sort();   // The order 
 Copy code 

Tool class ( export )

package com.wshanshi.test.util;

import com.wshanshi.test.entity.Excel;

import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;

import javax.servlet.http.HttpServletResponse;
import java.lang.reflect.Method;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;

 *  Export tool class 
public class ExcelExportUtils {
    public void ResponseInit(HttpServletResponse response, String fileName) {
        // Set up content-disposition The response header controls the browser to open the file as a download 
        response.setHeader("Content-Disposition", "attachment;filename=" + fileName + ".csv");
        // Let the server tell the browser that the data it sends belongs to excel file type 
        response.setHeader("Prama", "no-cache");
        response.setHeader("Cache-Control", "no-cache");
        response.setDateHeader("Expires", 0);

    public void POIOutPutStream(HttpServletResponse response, HSSFWorkbook wb) {

        try {
            BufferedOutputStream out = new BufferedOutputStream(response.getOutputStream());
        } catch (Exception e) {

    @SuppressWarnings({"unchecked", "rawtypes"})
    public void export(Class<?> objClass, List<?> dataList, HttpServletResponse response, String fileName) throws Exception {

        ResponseInit(response, fileName);

        Class excelClass = Class.forName(objClass.toString().substring(6));
        Method[] methods = excelClass.getMethods();

        Map<Integer, String> mapCol = new TreeMap<>();
        Map<Integer, String> mapMethod = new TreeMap<>();

        for (Method method : methods) {
            Excel excel = method.getAnnotation(Excel.class);
            if (excel != null) {
                mapCol.put(excel.sort(), excel.colName());
                mapMethod.put(excel.sort(), method.getName());
        HSSFWorkbook wb = new HSSFWorkbook();
        POIBuildBody(POIBuildHead(wb, "sheet1", mapCol), excelClass, mapMethod, (List<T>) dataList);

        POIOutPutStream(response, wb);

    public HSSFSheet POIBuildHead(HSSFWorkbook wb, String sheetName, Map<Integer, String> mapCol) {
        HSSFSheet sheet01 = wb.createSheet(sheetName);
        HSSFRow row = sheet01.createRow(0);
        HSSFCell cell;
        int i = 0;
        for (Map.Entry<Integer, String> entry : mapCol.entrySet()) {
            cell = row.createCell(i++);
        return sheet01;

    public void POIBuildBody(HSSFSheet sheet01, Class<T> excelClass, Map<Integer, String> mapMethod, List<T> dataList) throws Exception {

        HSSFRow r = null;
        HSSFCell c = null;

        if (dataList != null && dataList.size() > 0) {
            for (int i = 0; i < dataList.size(); i++) {
                r = sheet01.createRow(i + 1);
                int j = 0;
                for (Map.Entry<Integer, String> entry : mapMethod.entrySet()) {
                    c = r.createCell(j++);
                    Object obj = excelClass.getDeclaredMethod(entry.getValue()).invoke(dataList.get(i));
                    c.setCellValue(obj == null ? "" : obj + "");

 Copy code 

PostMan Test export , The effect is as follows .

 Insert picture description here

Ouch! , Sure .

 Please add a picture description

Although the data is exported , But I found a problem . There's less data ? I used to use python The data obtained is clearly more than 90 . Why did you only ask for 20 multiple ? It's a little strange .

 Please add a picture description

Take a closer look at the lower interface , It turned out that the page was loaded slowly . Request paging when you slide to the bottom .

 Insert picture description here Ow , I see . however , Now that you see the interface , It would be ..... use postMan Lower white .

 Insert picture description here

I giao, It's straight to the data .....

 Please add a picture description

So, There's another way . It's direct Http Request this interface , Just set the number of pages a little larger .

The sample code is as follows :

public List<BlogVo> blogHttp() {
    List<BlogVo> list = new ArrayList<>();
    String s = HttpClientUtils.doGetRequest("", null, null, null);
    RespDTO blogDTO = JSON.parseObject(s, RespDTO.class);
    DataEntity data = blogDTO.getData();
    data.getList().forEach(item -> {
        BlogVo blogVo = new BlogVo();
    return list;
 Copy code 

The effect is as follows , Try it yourself . Do you know .......  picture :!thumbnail?accessToken=eyJhbGciOiJIUzI1NiIsImtpZCI6ImRlZmF1bHQiLCJ0eXAiOiJKV1QifQ.eyJhdWQiOiJhY2Nlc3NfcmVzb3VyY2UiLCJleHAiOjE2MzYxMDI1NDMsImciOiJ3OVJ3VHRXd3dUeFFYdGRLIiwiaWF0IjoxNjM2MTAyMjQzLCJ1c2VySWQiOjY1NTk2MjQ5fQ.eYh-SFaKMI89DUbndePbvgqOQlWwfZopzhepCy1I0Pw

harm , Let's do it first ! Don't be bad .....

 Insert picture description here

copyright notice
author[wshanshi],Please bring the original link to reprint, thank you.

Random recommended