在使用Hexo插件:hexo-bilibili-bangumi时,当数据源为Bangumi时,只能获取到番剧名称和封面。
所以自己写了个爬虫来获取总集数、评分等信息。
2022/9/2:😅原来bgm有api,我是🤡
2022/9/13:不会写Hexo插件,把这个改成了用官方api方式获取使用Python获取番剧信息(二)。并且从requests换成了httpx,但是不会异步🤣。

需要电脑有python环境

所有爬取到的信息均来自bangumi.tv。仅用于补全插件所获得的信息,如有侵权,请联系删除。
此方法已被弃用,请移步使用Python获取番剧信息(二)

使用方法

  1. 先安装插件,具体方法详见插件主页
  2. 获取番剧数据,得到\source\_data\bangumis.json
  3. pip安装requestslxml
1
2
pip install requests
pip install lxml
  1. 根据已经获得的数据爬取每个番剧的信息,将python代码保存成文件放在项目根目录后运行。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import requests
import json
import time
from lxml import etree
from random import randint

USER_AGENTS = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]

random_agent = USER_AGENTS[randint(0, len(USER_AGENTS) - 1)]


def get_data(id):
class bangumis:
score = "-"
des = "-"
wish = "-"
doing = "-"
collect = "-"
totalCount = "-"

url = "https://bangumi.tv/subject/"
# webbrowser.open_new(url+id)
headers = {
"User-Agent": random_agent,
'Connection': 'close'
}
r = requests.get(url + id, headers=headers)
r.encoding = 'utf-8'
r_text = r.text
r.close()
time.sleep(2)

tree = etree.HTML(r_text)

try:
bangumis.score = tree.xpath("//span[@class='number']/text()")[0]
except:
bangumis.score = "-"
try:
bangumis.des = ''.join(tree.xpath("//div[@id='subject_summary']//text()"))
except:
bangumis.des = "-"
try:
bangumis.wish = tree.xpath("//div[@id='subjectPanelCollect']/span[@class='tip_i']/a[@class='l']/text()")[0][:-3]
except:
bangumis.wish = "-"
try:
bangumis.doing = tree.xpath("//div[@id='subjectPanelCollect']/span[@class='tip_i']/a[@class='l']/text()")[2][
:-3]
except:
bangumis.doing = "-"
try:
bangumis.collect = tree.xpath("//div[@id='subjectPanelCollect']/span[@class='tip_i']/a[@class='l']/text()")[1][
:-3]
except:
bangumis.collect = "-"

if tree.xpath("//ul[@id='infobox']/li/span[@class='tip']/text()")[1] == "话数: ":
bangumis.totalCount = tree.xpath("//ul[@id='infobox']/li/text()")[1]
else:
bangumis.totalCount = "12"

return bangumis


def read():
print("\n读取json信息")
with open("bangumis.json", "r", encoding="utf-8") as js_file:
js_txt = js_file.read()
py_data = json.loads(js_txt)
watching = py_data.get("watching")
watched = py_data.get("watched")
for i in watching:
i["type"] = "番剧"
print("\n正在获取 " + i["title"] + " 的信息")
bangumis = get_data(i["id"])
i["score"] = bangumis.score
i["des"] = bangumis.des
i["wish"] = bangumis.wish
i["doing"] = bangumis.doing
i["collect"] = bangumis.collect
i["totalCount"] = "全" + str(bangumis.totalCount) + "话"

for j in watched:
j["type"] = "番剧"
print("\n正在获取 " + j["title"] + " 的信息")
bangumis = get_data(j["id"])
j["score"] = bangumis.score
j["des"] = bangumis.des
j["wish"] = bangumis.wish
j["doing"] = bangumis.doing
j["collect"] = bangumis.collect
j["totalCount"] = "全" + str(bangumis.totalCount) + "话"

return py_data


def write(py_data):
print("\n写入新的json")
with open("output.json", "w", encoding="utf-8") as js_output:
js_output_txt = json.dumps(py_data, ensure_ascii=False)
js_output.write(js_output_txt)


if __name__ == "__main__":
write(read())
print("\n完成")
  1. 也可以用批处理文件把获取番剧和信息一起执行
1
2
3
4
@echo off
hexo bangumi -u

python main.py

已知问题

  1. 如果运行过于频繁好像还是会出现443。但是正常运行应该不会有问题,虽然已经在header中加了'Connection': 'close',后面也写了r.close()time.sleep(2)
  2. 运行速度太慢。有个time.sleep(2),所有每个番剧都会至少等2秒钟,考虑过异步,但是这样不是请求的更快了,应该更容易被ban啊🤣。看了看ip代理也没找到免费的。