前言

大家好,我是宋哈哈,大家在某寶購買了 某某教學視頻,每次看的時候,都需要網絡去看它,如果網絡不好的時候,豈不是看起來非常吃力?這樣我們就把它爬去下來吧。【以下代碼不一定適合你,但是破解 反爬蟲的思路可能適合你】

這是我在某寶購買的視頻教程:

element點擊大圖預覽_反反爬蟲


需要爬去的視頻預覽:

element點擊大圖預覽_python3_02

爬取教程講解

準備安裝好 selenium 和 requests ,json 包,因為淘寶的反爬蟲機制做的非常的嚴格,所以我這裏就用到了 selenium 來模擬瀏覽器。但是同時也要獲取cookie ,這裏的cookie 可不是直接訪問的cookie,是json 的cookie 。後面會講。

用手機登陸淘寶,分享鏈接到電腦上,電腦打開就是下面的狀態,一種手機的模式。當然也有PC端的,為什麼不用PC端的呢,因為我不用爬蟲模擬打開,真實的瀏覽器打開都不斷提示我登陸,無法觀看視頻。不知道是我網絡問題,還是淘寶本身問題。廢話不多説了。

element點擊大圖預覽_element點擊大圖預覽_03


打開後,基本操作 F12 .打開 newwork

element點擊大圖預覽_selenium_04

然後 按 F5 , 刷新一下瀏覽器,獲取所有的加載文件。

element點擊大圖預覽_element點擊大圖預覽_05

獲取到所有的加載文件,後點擊 network 的有一個放大鏡 按鈕,然後左邊 彈出輸入框 ,輸入 你想要爬去的資料的關鍵詞:入門 , 按下 圓圈 按鈕 ,下方就彈出了一個 文件,這個資料的所有信息就在這個文件下。

element點擊大圖預覽_python3_06

單擊 出來的 文件url , 右邊就會跳出 headers , preview 等信息。

element點擊大圖預覽_element點擊大圖預覽_07


此時,我們單擊, Preview ,然後去用鼠標點擊找到我們的的數據。

element點擊大圖預覽_element點擊大圖預覽_08


上面的數據,我們需要獲取到 一些 ID 來拼湊出 資料視頻的真實鏈接,因為我們用調試模式 的到的連接,是一個假的錯誤URL。

element點擊大圖預覽_element點擊大圖預覽_09

點擊這個連接後,視頻是不能播放的。

element點擊大圖預覽_element點擊大圖預覽_10

每個視頻的URL 是這樣的 :
https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.4&resourceId=5183000&sectionId=11093072&channel=2&courseId=78011&resourceId=5183000&live=false&title=01-Python%E5%85%A5%E9%97%A8%E5%9F%BA%E7%A1%80%E7%AE%80%E4%BB%8B &img=%2F%2Fgw.alicdn.com%2Fbao%2Fuploaded%2Fi1%2F2705259897%2FTB2l51DvdBopuFjSZPcXXc9EpXa_!!2705259897.jpg&from=detail

這個時候,我們就需要分析其中的url 連接代碼了。試着去刪除一些無用 參數。紅色加粗的 刪除了。發現還能播放。

https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.4&resourceId=5183000&sectionId=11093072&channel=2&courseId=78011&resourceId=5183000&live=false&title=01-Python%E5%85%A5%E9%97%A8%E5%9F%BA%E7%A1%80%E7%AE%80%E4%BB%8B

URL 代碼分析:

https://h5.m.taobao.com/xue/play/index.html? :這段url 路徑 每個視頻都是一樣不變。
spm=a2174.7623065.6.4 : 這個就是參數了,我發現第一個視頻 為 a2174.7623065.6.4 ,第二個就是 a2174.7623065.6.5 ,也就是説最後的實在變化,第一個就是 4 第二個就是 5 ,也就是得出 算法 a2174.7623065.6.x+3 , 其中的x 就是 第幾頁,第一頁就是 1+3 ,第二頁就是 2+2 ,這樣就可以一個 for 循環。

resourceId=5183000

element點擊大圖預覽_python3_11


sectinotallow=11093072

element點擊大圖預覽_爬淘寶視頻_12

channel=2

courseId=78011

element點擊大圖預覽_反反爬蟲_13

live=false&title=01-Python%E5%85%A5%E9%97%A8%E5%9F%BA%E7%A1%80%E7%AE%80%E4%BB%8B

element點擊大圖預覽_爬淘寶視頻_14

當我們知道了這個真實的URL是如何組成的後,我們就開幹吧。打開剛剛獲取的文件,發現的一個json 文件。這個時候我們就需要獲取上面的 requests url 和 cookie ,還有 user-agent ,Referer 和 host

URL :

element點擊大圖預覽_反反爬蟲_15


COOKIE :

element點擊大圖預覽_element點擊大圖預覽_16


Rerfer 和 Host :

element點擊大圖預覽_python3_17

USER-AGENT :

element點擊大圖預覽_反反爬蟲_18

獲取以上參數,我們就可以模擬偽裝自己,去爬去資料啦:

代碼如下【非完全代碼,請勿複製直接使用】:

# encoding:utf-8
import json
import requests
from selenium import webdriver
import time
import os
headers = {
    'cookie':'*****************************',
    'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36",
    'host': 'api.m.taobao.com',

}
url = "https://api.m.taobao.com/h5/mtop.lifemallweb.courseinfomtopservice.getoutline/1.0/?appKey=12574478&t=1605101345112&sign=7aeb0c41b0bc943feb126a945606f499&v=1.0&api=mtop.lifemallweb.courseinfomtopservice.getoutline&type=jsonp&dataType=jsonp&callback=mtopjsonp4&data=%7B%22courseId%22%3A%2278011%22%7D"
requ = requests.get(url, headers=headers)
html = requ.text.replace('mtopjsonp4(', '').replace(')', '')
jsontext = json.loads(html)
print(jsontext)
datas = jsontext['data']['data']['outline']['chapters']

# 'https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.4&resourceId= 5183000 §ionId= 11093072 &channel=2&courseId= 78011 &resourceId= 5183000 &live=false&title=4 &resourceId= 5183000 §ionId= 11093072 &channel=2&courseId= 78011 &resourceId= 5183000 &live=false&title='

number = 0
#'https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.5&resourceId=5184940§ionId=11064015&channel=2&courseId=78011&resourceId=5184940'
#'https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.5&resourceId=5184940§ionId=11064015&channel=2&courseId=78011&resourceId=5184940'
for d in datas:
    video_url_info = d['sections']
    courseid = d['courseId']
    zj_title = d['title']  # 章節
    print(zj_title)
    # print(courseid)
    for v in video_url_info:
        v_id = v['resources']
        sectionid = v['id']
        # print(sectionid)
        for vi in v_id:
            video_id = vi['id']
            video_title = vi['title']  # 文章標題
            print(video_title)
            # print(video_id)
            number += 1
            url = 'https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.%s&resourceId=%s§ionId=%s&channel=2&courseId=%s&resourceId=%s' % (
            number + 3, video_id, sectionid, courseid, video_id)
            print(url)

這裏我們就得到 所有視頻的 真是url ,效果如下:

element點擊大圖預覽_python3_19

得到視頻URL ,我們就需要進入視頻的頁面,獲取下載url , 然後利用 requests 下載視頻。

element點擊大圖預覽_selenium_20

然後 F12 ,進入到調試模式 。然後按照慣例,找到該視頻的真實的 mp4 下載鏈接

element點擊大圖預覽_反反爬蟲_21


上面通過一些手工搜索 mp4 發現都是不能用的,應該是用JS,調用出來的,所以這個時候我們就用到 selenium ,直接獲取 網頁源代碼。

element點擊大圖預覽_python3_22


我們需要找到 這個url ,然後並獲取它。

element點擊大圖預覽_selenium_23


代碼 如下 :

# encoding:utf-8
import json
import requests
from selenium import webdriver
import time
import os
headers = {


    'cookie':'************************',


    'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36",

    'host': 'api.m.taobao.com',

}

chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2, 'permissions.default.stylesheet': 2}
# chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_experimental_option("excludeSwitches", ['enable-automation'])
chrome_options.add_experimental_option("useAutomationExtension", False)
# chrome_options.add_argument('--host-resolver-rules=MAP g.alicdn.com 127.0.0.1')

driver = webdriver.Chrome(r"D:\pro_py\auto_office\chromedriver\chromedriver.exe", chrome_options=chrome_options)
chrome_options.add_argument(
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36')
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument",
                       {"source": """Object.defineProperty(navigator, 'webdriver', {get: () => undefined})""", })
fahuo_login_url = "https://h5.m.taobao.com/xue/detail.htm?ut_sk=1.WrHJH%2BF63IwDAD7l4yx9ORsJ_21380790_1605021159729.Copy.10000&itemId=552400591967&sourceType=other&ttid=201200%40taobao_iphone_8.8.0&suid=E0B7CB1C-710F-4610-95F7-2F047A369759&un=f249ba94edcb21f83911dc70f8aca8ac&share_crt_v=1&spm=a2159r.13376460.0.0&sp_tk=cmhpN2NQbXJPS0Q=&cpp=1&shareurl=true&short_name=h.43dpkg8&bxsign=scd59xbtg7Htg78I-yuxEtr95S91S-KaaGvXhZeF38L2SpC7Snd8C30bU-b2Q3xUk745Jc2fKHdj9pF3C9ORG909CyQkA8WaRT-rsnCKw-LmV0&sm=ab441c&app=chrome"

driver.maximize_window()
driver.get(fahuo_login_url)
time.sleep(2)
driver.find_element_by_xpath('//*[@id="J_BottomBanner"]/ul/li[4]/a').click()
time.sleep(2)
driver.find_element_by_css_selector('#fm-login-id').send_keys('xxxxxx') # 輸入 淘寶賬號 
time.sleep(1)
driver.find_element_by_css_selector('#fm-login-password').send_keys('xxxxxxxxxxxxxx') # 輸入 密碼
time.sleep(2)
driver.find_element_by_css_selector('#login-form > div.fm-btn > button').click()
time.sleep(10)

# url = 'https://api.m.taobao.com/h5/mtop.lifemallweb.courseinfomtopservice.getoutline/1.0/?appKey=12574478&t=1605036880894&sign=5c4fd7d263d2f5cdefa5d4dacc6adb2f&v=1.0&api=mtop.lifemallweb.courseinfomtopservice.getoutline&type=jsonp&dataType=jsonp&callback=mtopjsonp4&data=%7B%22courseId%22%3A%2278011%22%7D'
# url = 'https://api.m.taobao.com/h5/mtop.lifemallweb.courseinfomtopservice.getoutline/1.0/?appKey=12574478&t=1605076799899&sign=6144d16087af79182743b65fd887b3c7&v=1.0&api=mtop.lifemallweb.courseinfomtopservice.getoutline&type=jsonp&dataType=jsonp&callback=mtopjsonp4&data=%7B%22courseId%22%3A%2278011%22%7D'



# url = "https://api.m.taobao.com/h5/mtop.lifemallweb.courseinfomtopservice.getoutline/1.0/?appKey=12574478&t=1605101345112&sign=7aeb0c41b0bc943feb126a945606f499&v=1.0&api=mtop.lifemallweb.courseinfomtopservice.getoutline&type=jsonp&dataType=jsonp&callback=mtopjsonp4&data=%7B%22courseId%22%3A%2278011%22%7D"

url = 'https://api.m.taobao.com/h5/mtop.lifemallweb.courseinfomtopservice.getoutline/1.0/?appKey=12574478&t=1605275205588&sign=5b47cc8aafc78339c8535a48fbbca372&v=1.0&api=mtop.lifemallweb.courseinfomtopservice.getoutline&type=jsonp&dataType=jsonp&callback=mtopjsonp4&data=%7B%22courseId%22%3A%2278011%22%7D'


requ = requests.get(url, headers=headers)
html = requ.text.replace('mtopjsonp4(', '').replace(')', '')
jsontext = json.loads(html)
print(jsontext)
datas = jsontext['data']['data']['outline']['chapters']

# 'https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.4&resourceId= 5183000 §ionId= 11093072 &channel=2&courseId= 78011 &resourceId= 5183000 &live=false&title=4 &resourceId= 5183000 §ionId= 11093072 &channel=2&courseId= 78011 &resourceId= 5183000 &live=false&title='

number = 0
#'https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.5&resourceId=5184940§ionId=11064015&channel=2&courseId=78011&resourceId=5184940'
#'https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.5&resourceId=5184940§ionId=11064015&channel=2&courseId=78011&resourceId=5184940'
for d in datas:
    video_url_info = d['sections']
    courseid = d['courseId']
    zj_title = d['title']  # 章節
    print(zj_title)
    # print(courseid)
    for v in video_url_info:
        v_id = v['resources']
        sectionid = v['id']
        # print(sectionid)
        for vi in v_id:
            video_id = vi['id']
            video_title = vi['title']  # 文章標題
            print(video_title)
            # print(video_id)
            number += 1
            url = 'https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.%s&resourceId=%s§ionId=%s&channel=2&courseId=%s&resourceId=%s' % (
            number + 3, video_id, sectionid, courseid, video_id)
            print(url)
            driver.get(url)
            down_url = driver.find_element_by_css_selector('#J_Video > source:nth-child(2)').get_attribute('src') # 獲取元素的信息,也就是 上面 的 soure url 的信息。

獲取到真是的下載地址後,其實這個時候是已經可以用迅雷下載的。但是還是無法用 requests 下載,按照道理是可以的:

down_url :https://cloud.video.taobao.com/play/u/2705259897/p/1/e/6/t/1/d/hd/235227068483.mp4?auth_key=YXBwX2tleT04MDQ0ODEmYXV0aF9pbmZvPXsiY291cnNlSWQiOjc4MDExLCJiaXpUeXBlIjoiYnV5IiwidmlkZW9JZCI6MjM1MjI3MDY4NDgzLCJlbmNyeXB0SWQiOiJGRTc5QjY5RjVBRDg2NTVEQTY1N0UxQzY1RUUxOUY3NCJ9JnRpbWVzdGFtcD0xNjA1MzYzNTc5&hardware=true

element點擊大圖預覽_爬淘寶視頻_24


以下是我自己的辦法:

利用 selenium 點擊那個url , 然後獲取當前頁的 mp4 。因為上面一大串的 url ,會跳轉 到 https://xxxxxx.mp4

element點擊大圖預覽_element點擊大圖預覽_25

全部代碼如下【我這裏是把每個教程創建了一個文本,然後把mp4 連接寫入到了文本中】:

# encoding:utf-8
import json
import requests
from selenium import webdriver
import time
import os
headers = {
    # 'cookie':'cna=d9MyGFXJy3ECAd+VP+C9tMBO; _m_h5_tk=fd0dcb980c4bb8d4bb200b65848c0b0b_1605111359929; _m_h5_tk_enc=64fc55729b0678e8f4e1ce878a9e413f; cookie2=1677f4a1a649100c194e4729930edfd2; t=29bbbfdee40cdae4ef00175b0b6b0d52; _tb_token_=e1e795083e031; _samesite_flag_=true; xlly_s=1; tfstk=cJSPBWc-yuEPCztQW3tEPjmCRSLRZMXlKmJ6rwoA77GUw3TliXlpn8bM3pOTmUf..; ockeqeudmj=mravPk0%3D; _w_tb_nick=%E7%8B%82%E9%A3%99%E7%9A%84%E8%9C%97%E7%89%9Bye; munb=2701589139; WAPFDFDTGFG=%2B4cMKKP%2B8PI%2Bu50IDqlWEadoeeFOG9NuUXcUIw%3D%3D; _w_app_lg=0; sgcookie=E10099yD1YiYYjAQLQzVKLKIWUsl%2BTK2xxRgIhtnb5XRWE448Ao8gy3h6IkuaLit2SQNJlCfhdqvSuvbmlooKijo2A%3D%3D; unb=2701589139; uc3=lg2=W5iHLLyFOGW7aA%3D%3D&vt3=F8dCufOCQdmQJL524GE%3D&nk2=3EWY2QbTAkdb7KWx&id2=UU8IPTyRbKU2yw%3D%3D; uc1=cookie14=Uoe0aD3LMu58Zw%3D%3D&cookie15=V32FPkk%2Fw0dUvg%3D%3D&cookie21=WqG3DMC9Edo1SB5NB6Qtng%3D%3D&existShop=true; csg=2a4a3f56; lgc=%5Cu72C2%5Cu98D9%5Cu7684%5Cu8717%5Cu725Bye; ntm=0; cookie17=UU8IPTyRbKU2yw%3D%3D; dnk=%5Cu72C2%5Cu98D9%5Cu7684%5Cu8717%5Cu725Bye; skt=cb1d6520a45792bb; uc4=nk4=0%403jG%2Fc5Qu3r%2B0Ew2RwiFRlxgVGjEiX24%3D&id4=0%40U22PGMm3NWDuPibXLvJw2b42sdl9; tracknick=%5Cu72C2%5Cu98D9%5Cu7684%5Cu8717%5Cu725Bye; _cc_=UIHiLt3xSw%3D%3D; _l_g_=Ug%3D%3D; sg=e90; _nk_=%5Cu72C2%5Cu98D9%5Cu7684%5Cu8717%5Cu725Bye; cookie1=B0EzweA7TLKhu4P1CHazoo1aUxo3HEXdG9jckoF2hK4%3D; isg=AmRk0H2M53wdMBOwyVXRSt-QNWIPTpen9sLlnn6F6i_yKQTzpg1Y95qPnzfN; l=AomJ4AWMVtIQxi0vYpuqC/gDGb7jzH0I',

    'cookie':'cna=BzcpGD+dYjUCAd+VQzVfH0gp; lgc=%5Cu72C2%5Cu98D9%5Cu7684%5Cu8717%5Cu725Bye; tracknick=%5Cu72C2%5Cu98D9%5Cu7684%5Cu8717%5Cu725Bye; thw=cn; enc=i9M1y3B%2F9lEte6JjPplqFT68R%2FkEyR3tnIr%2BI1gZZ9hgTcv%2BHJpF1XCgJr04vNohDo9bbNGDaJPZOQmvrRIDcQ%3D%3D; t=07a4276138be8389e3560e2f0ba12ae0; miid=2055164070163868920; x=e%3D1%26p%3D*%26s%3D0%26c%3D0%26f%3D0%26g%3D0%26t%3D0; _fbp=fb.1.1604983772781.845569496; hng=CN%7Czh-CN%7CCNY%7C156; _w_tb_nick=%E7%8B%82%E9%A3%99%E7%9A%84%E8%9C%97%E7%89%9Bye; munb=2701589139; WAPFDFDTGFG=%2B4cMKKP%2B8PI%2Bu50IDqlWEadoeeFOG9NuUXcUIw%3D%3D; _w_app_lg=0; xlly_s=1; sgcookie=E100lUndR9DLvGBW74N4aRThUBwdhIyrIEj1MTQH4a9E3qcijL3UoP4r7sccnc7ClfyiUSpzaAVSbhk%2FXFcF8vBwHw%3D%3D; uc3=id2=UU8IPTyRbKU2yw%3D%3D&lg2=VFC%2FuZ9ayeYq2g%3D%3D&nk2=3EWY2QbTAkdb7KWx&vt3=F8dCufOMOXW9b1BqWgM%3D; uc4=nk4=0%403jG%2Fc5Qu3r%2B0Ew2RwiFRlxgWZyB%2FbFc%3D&id4=0%40U22PGMm3NWDuPibXLvJw2pfVfuuC; _cc_=VFC%2FuZ9ajQ%3D%3D; mt=ci=37_1; tfstk=c7i5BpGwm_f57CLF4Y94UEhkREZha65bLUwKF4EQ05AMSnH4MsXAb5w-42fGDPef.; _m_h5_tk=b9d23539f1cbc8b8a1a9f3d352375cf0_1605283946106; _m_h5_tk_enc=21d03b9ac4eee3ca96c6ced27af25fe8; cookie2=22c7b2fe75abc02d076df67203b20e01; _tb_token_=79b4f1eb55abb; isg=AsHBP0cber5BV5aKHYKj1mlG0A2CXWpRANwzrSMWqUgnCuHcaz5FsO8Mmsiw; l=Am5usdtQzP1WE8mDIv7Vz9KuPs8wezJp',


    'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36",

    'host': 'api.m.taobao.com',

}

chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2, 'permissions.default.stylesheet': 2}
# chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_experimental_option("excludeSwitches", ['enable-automation'])
chrome_options.add_experimental_option("useAutomationExtension", False)
# chrome_options.add_argument('--host-resolver-rules=MAP g.alicdn.com 127.0.0.1')

driver = webdriver.Chrome(r"D:\pro_py\auto_office\chromedriver\chromedriver.exe", chrome_options=chrome_options)
chrome_options.add_argument(
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36')
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument",
                       {"source": """Object.defineProperty(navigator, 'webdriver', {get: () => undefined})""", })
fahuo_login_url = "https://h5.m.taobao.com/xue/detail.htm?ut_sk=1.WrHJH%2BF63IwDAD7l4yx9ORsJ_21380790_1605021159729.Copy.10000&itemId=552400591967&sourceType=other&ttid=201200%40taobao_iphone_8.8.0&suid=E0B7CB1C-710F-4610-95F7-2F047A369759&un=f249ba94edcb21f83911dc70f8aca8ac&share_crt_v=1&spm=a2159r.13376460.0.0&sp_tk=cmhpN2NQbXJPS0Q=&cpp=1&shareurl=true&short_name=h.43dpkg8&bxsign=scd59xbtg7Htg78I-yuxEtr95S91S-KaaGvXhZeF38L2SpC7Snd8C30bU-b2Q3xUk745Jc2fKHdj9pF3C9ORG909CyQkA8WaRT-rsnCKw-LmV0&sm=ab441c&app=chrome"

driver.maximize_window()
driver.get(fahuo_login_url)
time.sleep(2)
driver.find_element_by_xpath('//*[@id="J_BottomBanner"]/ul/li[4]/a').click()
time.sleep(2)
driver.find_element_by_css_selector('#fm-login-id').send_keys('15024341207')
time.sleep(1)
driver.find_element_by_css_selector('#fm-login-password').send_keys('china.')
time.sleep(2)
driver.find_element_by_css_selector('#login-form > div.fm-btn > button').click()
time.sleep(10)

# url = 'https://api.m.taobao.com/h5/mtop.lifemallweb.courseinfomtopservice.getoutline/1.0/?appKey=12574478&t=1605036880894&sign=5c4fd7d263d2f5cdefa5d4dacc6adb2f&v=1.0&api=mtop.lifemallweb.courseinfomtopservice.getoutline&type=jsonp&dataType=jsonp&callback=mtopjsonp4&data=%7B%22courseId%22%3A%2278011%22%7D'
# url = 'https://api.m.taobao.com/h5/mtop.lifemallweb.courseinfomtopservice.getoutline/1.0/?appKey=12574478&t=1605076799899&sign=6144d16087af79182743b65fd887b3c7&v=1.0&api=mtop.lifemallweb.courseinfomtopservice.getoutline&type=jsonp&dataType=jsonp&callback=mtopjsonp4&data=%7B%22courseId%22%3A%2278011%22%7D'



# url = "https://api.m.taobao.com/h5/mtop.lifemallweb.courseinfomtopservice.getoutline/1.0/?appKey=12574478&t=1605101345112&sign=7aeb0c41b0bc943feb126a945606f499&v=1.0&api=mtop.lifemallweb.courseinfomtopservice.getoutline&type=jsonp&dataType=jsonp&callback=mtopjsonp4&data=%7B%22courseId%22%3A%2278011%22%7D"
url = 'https://api.m.taobao.com/h5/mtop.lifemallweb.courseinfomtopservice.getoutline/1.0/?appKey=12574478&t=1605275205588&sign=5b47cc8aafc78339c8535a48fbbca372&v=1.0&api=mtop.lifemallweb.courseinfomtopservice.getoutline&type=jsonp&dataType=jsonp&callback=mtopjsonp4&data=%7B%22courseId%22%3A%2278011%22%7D'


requ = requests.get(url, headers=headers)
html = requ.text.replace('mtopjsonp4(', '').replace(')', '')
jsontext = json.loads(html)
print(jsontext)
datas = jsontext['data']['data']['outline']['chapters']

# 'https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.4&resourceId= 5183000 §ionId= 11093072 &channel=2&courseId= 78011 &resourceId= 5183000 &live=false&title=4 &resourceId= 5183000 §ionId= 11093072 &channel=2&courseId= 78011 &resourceId= 5183000 &live=false&title='

number = 0
'https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.5&resourceId=5184940§ionId=11064015&channel=2&courseId=78011&resourceId=5184940'
'https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.5&resourceId=5184940§ionId=11064015&channel=2&courseId=78011&resourceId=5184940'
for d in datas:
    video_url_info = d['sections']
    courseid = d['courseId']
    zj_title = d['title']  # 章節
    print(zj_title)
    # print(courseid)
    for v in video_url_info:
        v_id = v['resources']
        sectionid = v['id']
        # print(sectionid)
        for vi in v_id:
            video_id = vi['id']
            video_title = vi['title']  # 文章標題
            print(video_title)
            # print(video_id)
            number += 1
            url = 'https://h5.m.taobao.com/xue/play/index.html?spm=a2174.7623065.6.%s&resourceId=%s§ionId=%s&channel=2&courseId=%s&resourceId=%s' % (
            number + 3, video_id, sectionid, courseid, video_id)
            print(url)
            driver.get(url)
            down_url = driver.find_element_by_css_selector('#J_Video > source:nth-child(2)').get_attribute('src')
            time.sleep(5)
            driver.get(down_url)
            time.sleep(3)
            mp4_url = driver.current_url

            print(mp4_url)
            while True:
                if os.path.isdir("pro_file/%s"%zj_title): 
                    with open("pro_file/%s/%s.txt"%(zj_title,video_title),'w',encoding='utf-8') as f: # 寫入 mp4 鏈接。
                        f.write(mp4_url)
                        f.close()
                        break
                        
                else:
                    os.makedirs("pro_file/%s"%zj_title)

等上面獲取好所有 標題 url 後:在寫一個 request 去下載上面的 視頻,代碼如下:

#encoding:utf-8
import os
import requests
path = 'pro_file'



filepath = os.listdir(path)
for f in filepath:

    class_name = f # 分類文件名稱
    filepath2 = os.path.join(path,f) # 分類路徑
    print(filepath2)
    filepath3 = os.listdir(filepath2)
    for f3 in filepath3:

        filename = str(f3).replace('.txt','') # 視頻文件名稱
        video_path = os.path.join(filepath2,f3)
        with open(video_path,'r',encoding='utf-8') as f:
            down_url = f.read()
            requ = requests.get(down_url)
            with open("%s/%s.mp4"%(filepath2,filename),'wb') as m:  # 下載視頻。
                m.write(requ.content)
                m.close()

得到文件如下:

element點擊大圖預覽_element點擊大圖預覽_26


視頻文件如下:

element點擊大圖預覽_反反爬蟲_27

總結:因為其中用到了 selenium 去 反反爬蟲,我感覺還不是很好,這個代碼。所以如果有人能看懂這個文章,能有更好的辦法,請在下方留言,非常感謝。