Python: requests HTTP 库

2023-10-19 10:29:20 +08:00 · 2023-10-19 10:29:20 +08:00 · c08c12a5e7
commit c08c12a5e7
parent e7fabbe09b
1 changed files with 422 additions and 0 deletions
--- a/wiki/programming-language/Python/进阶/拓展模块/网络处理/requests
+++ b/wiki/programming-language/Python/进阶/拓展模块/网络处理/requests
@ -0,0 +1,422 @@
 ---
 title: requests HTTP 库
 description: requests HTTP 库，基于 urllib 封装。
 keywords:
  - requests
  - Python
  - 拓展模块
 tags:
  - requests
  - Python
 sidebar_position: 3
 author: 7Wate
 date: 2023-10-19
 ---
 Requests 是 Python 的一个非常流行和强大的 HTTP 库，使用 Requests 可以极其方便地发送 HTTP/HTTPS 请求，然后获取并解析响应。它的语法简洁而又优雅，出色地符合 Python 风格，相比起 Python自带的 urllib 来说，Requests 无疑更加人性化和易用。
 ## 概述
 **Requests 库的目标就是让 HTTP 请求变得简单而又 Pythonic。**它可以显著减少程序员发送 HTTP 请求的工作量。使用 Requests，你不必手动添加查询字符串到 URL 中，或 encode post 数据了。一切都自动完成。
 ### Requests vs urllib
 Python 内置的 urllib 模块也可以发送网络请求，但其 API 不够优雅简洁。与 urllib 相比，Requests 更加 Pythonic，而且更简单易用。
 ```python
 # 使用 urllib 获取一个网页的代码
 import urllib.request
 response = urllib.request.urlopen('https：//www.python.org')
 html = response.read()
 # 使用 Requests 获取同一个页面的代码
 import requests
 response = requests.get('https：//www.python.org')  
 html = response.text
 # Requests 允许使用 params 关键字传递参数，数据自动编码
 payload = {'key1'： 'value1'， 'key2'： 'value2'}  
 r = requests.get('https：//httpbin.org/get'， params=payload)
 # 而通过 urllib 则要手动编码
 import urllib.parse
 import urllib.request
 url = 'https：//httpbin.org/get' + '?' + urllib.parse.urlencode(payload)
 resp = urllib.request.urlopen(url)
 ```
 总之，Requests 相对于 urllib 更加简洁易用。
 | 对比项       | requests                            | urllib                            |
 | :----------- | :---------------------------------- | :-------------------------------- |
 | 发送请求方法 | 简洁的 requests.get()/post()        | 较复杂的 urllib.request.urlopen() |
 | 参数传递     | 自动编码，直接传 dict               | 需要手动 urlencode                |
 | 请求头       | 直接传 dict 作为 headers            | 通过 Request 类设置               |
 | 响应内容     | 多属性访问 text/content/json/raw 等 | 仅 response.read()                |
 | 编码支持     | 自动编码                            | 需要手动编码                      |
 | 连接池       | 支持连接池和会话                    | 不支持                            |
 | 异常处理     | 提供多种请求相关异常                | 仅 urllib.error 异常              |
 | 证书验证     | 通过 verify 参数验证 SSL 证书       | 通过 context 参数验证             |
 | 代理设置     | 支持通过 proxies 参数               | 较复杂的 ProxyHandler             |
 | Cookies      | 提供 cookie 参数                    | 通过 cookielib 模块管理           |
 | 重定向       | 自动处理，可通过 max_redirects 配置 | 需要手动处理                      |
 | 基本认证     | 通过 auth 参数                      | 通过 HTTPSHandler                 |
 | 流请求       | 内置支持                            | 需要自定义                        |
 | 异步请求     | 支持异步模式                        | 不支持异步                        |
 ### Requests 的关键特性
 - 继承了 urllib 的所有特性
 - 支持 HTTP 连接保持和连接池，提高效率
 - 支持使用 cookie 跟踪会话
 - 支持文件上传
 - 支持自动解码内容
 - 支持国际化的 URL 和 POST 数据自动编码
 - 更加 Pythonic 的 API
 - 连接超时设置
 - 支持 HTTPS 请求，SSL 证书验证
 - 自动解压
 - 流下载
 - 支持基本/摘要式的身份认证
 ## 基础用法
 ### 构造请求
 ```python
 import requests
 # requests.get 用于获取页面信息，
 response = requests.get('https：//www.example.com') 
 # requests.post 用于提交 POST 请求。
 response = requests.post('https：//httpbin.org/post'， data = {'key'：'value'})
 ```
 ### 获取响应
 ```python
 # 获取响应的内容使用 text 属性
 html = response.text
 # 获取二进制响应内容使用 content 属性
 png_data = response.content
 # 获取 JSON 响应使用 json() 方法
 json_data = response.json()
 ```
 ### 获取响应状态码
 获取响应状态码，可以检查 response.status_code：
 ```python
 print(response.status_code)
 200
 ```
 Requests 还提供了一个内置的状态码查询对象 requests.codes。例如：
 ```python
 print(requests.codes.ok) 
 200
 ```
 ### 请求参数
 向请求中传入参数，有以下几种方法：
 1. 通过 params 参数传入键值对
 ```python
 payload = {'key1'： 'value1'， 'key2'： 'value2'}
 r = requests.get('https：//httpbin.org/get'， params=payload) 
 ```
 2. 通过字典直接作为 params 参数传入
 ```python
 params = {'key1'： 'value1'， 'key2'： 'value2'}
 r = requests.get('https：//httpbin.org/get'， params)
 ```
 3. 通过 url 中的查询字符串传递参数
 ```python
 url = 'https：//httpbin.org/get?key1=val1&key2=val2'
 r = requests.get(url)
 ```
 ### 设置请求头
 可以通过 headers 参数设置 HTTP 请求头，例如：
 ```python
 url = 'https：//httpbin.org/get'
 headers = {'user-agent'： 'my-app/0.0.1'}
 r = requests.get(url， headers=headers)
 ```
 ### 响应内容
 对于响应内容，有多种属性供访问：
 | 属性          | 说明                           |
 | :------------ | :----------------------------- |
 | r.text        | 字符串形式的响应体，会自动解码 |
 | r.content     | 字节形式的响应体，可迭代       |
 | r.json()      | 将 JSON 响应转换为字典         |
 | r.raw         | 原始响应体，需要自行解码       |
 | r.encoding    | 响应体编码方式                 |
 | r.status_code | HTTP 响应状态码                |
 | r.headers     | 响应头部的字典                 |
 | r.request     | 请求的 Request 对象            |
 | r.url         | 请求的 URL                     |
 | r.history     | 请求的重定向信息               |
 例如：
 ```python
 r = requests.get('https：//api.github.com')
 print(r.text) # 字符串形式的响应体
 print(r.content) # 字节形式的响应体，可迭代  
 print(r.json()) # JSON格式转换为字典  
 print(r.raw) # 返回原始响应体
 ```
 ## 高级用法
 Requests 还提供了很多高级功能，极大地丰富了这一模块的使用场景。
 ### 会话维持
 Requests 提供了 session 对象，用于实现会话维持：
 ```python
 s = requests.Session()
 s.get('http：//httpbin.org/cookies/set/sessioncookie/123456789')
 r = s.get("http：//httpbin.org/cookies")
 print(r.text)
 # '{"cookies"： {"sessioncookie"： "123456789"}}'
 ```
 **默认的 requests 函数并不会在同一个 session 中保持 cookie**，所以它不会在跨请求保持状态。要保持会话，就需要使用 session 对象。
 ### SSL 证书验证
 Requests 可以验证 SSL 证书，你可以指定一个本地证书用作客户端证书，以完成客户端验证：
 ```python
 import requests
 resp = requests.get('https：//example.com'， verify='path/to/certfile')
 ```
 或者你也可以指定一个本地证书作为 CA 证书 BUNDLE，来验证请求的 TLS 服务端证书：
 ```python
 import requests
 resp = requests.get('https：//example.com'， verify='path/to/cacert.pem')
 ```
 ### 代理设置
 使用代理也很简单：
 ```python
 import requests
 proxies = {
  "http"： "http：//10.10.1.10：3128"，
  "https"： "http：//10.10.1.10：1080"，
 }
 requests.get("http：//example.org"， proxies=proxies)
 ```
 你也可以通过环境变量 HTTP_PROXY 和 HTTPS_PROXY 配置代理。
 ### 超时设置
 通过 timeout 参数，可以告诉 requests 等待服务器响应的超时时间，以秒为单位：
 ```python
 requests.get('https：//github.com'， timeout=0.001)
 ```
 分别为连接超时 connect timeou t和读取超时 read timeout：
 ```python
 requests.get('https：//github.com'， timeout=(3.05， 10))
 ```
 好的，文章继续：
 ### 异常处理
 Requests 的异常类型主要分为以下几类：
 - **连接异常**：包括 RequestsConnectionError 和 ConnectTimeout，表示与远程服务器的连接发生错误。
 - **超时异常**：RequestsTimeout 表示请求超时。可以分为连接超时和读取超时。
 - **TooManyRedirects**：表示重定向次数超过了最大限制（默认为30次）。
 - **HTTP 错误**：HTTPError 表示 HTTP 错误响应，例如 404 或者 500 等。Requests 会自动为其封装异常。
 - **请求异常**：RequestException 是 Requests 库自身的异常基类。
 - **SSL 错误**：SSLError 表示 SSL 证书验证错误。
 - **代理错误**：ProxyError 表示代理连接失败。
 - **数据解析错误**：JSONDecodeError 和 DecodeError 表示响应数据解析错误。
 - **其他**：ConnectionError、InvalidURL 等其他异常。
 可以通过 try except 语句捕获这些异常：
 ```python
 import requests
 try：
    response = requests.get('https：//httpbin.org/delay/10'， timeout=2)
 except requests.ConnectTimeout：
    print('Connection timed out')  
 except requests.ConnectionError：
    print('Connection error')
 ```
 如果不捕获异常，程序会中断并抛出异常。
 | 异常类型                | 说明            |
 | :---------------------- | :-------------- |
 | RequestsConnectionError | 网络连接错误    |
 | ConnectTimeout          | 连接超时错误    |
 | RequestsTimeout         | 请求超时错误    |
 | TooManyRedirects        | 重定向次数超限  |
 | HTTPError               | HTTP错误响应    |
 | RequestException        | 请求异常基类    |
 | SSLError                | SSL证书验证错误 |
 | ProxyError              | 代理连接错误    |
 | JSONDecodeError         | JSON解析错误    |
 | ConnectionError         | 连接错误        |
 ### 流式下载
 对于大文件下载，可以使用流模式节省内存：
 ```python
 with requests.get('http：//httpbin.org/stream/100'， stream=True) as r：
    for chunk in r.iter_content(chunk_size=1024)： 
        print(chunk)
 ```
 该模式仅当你在迭代时才会持续下载响应体部分，如果你要多次读取响应，必须使用 r.content 访问内容。
 ### 连接重试
 可以通过设置 retries 参数，让请求在遇到连接错误时自动重试指定次数：
 ```python
 from requests.adapters import HTTPAdapter
 s = requests.Session()
 retries = Retry(total=5， backoff_factor=1， status_forcelist=[502， 503， 504])
 s.mount('http：//'， HTTPAdapter(max_retries=retries)) 
 ```
 如果响应状态码是 502/503/504，该请求会重试最多 5 次。
 ## 实践技巧
 ### 文件上传
 Requests 使得文件上传变得极其简单：
 ```python
 url = 'https：//httpbin.org/post'
 files = {'file'： open('report.pdf'， 'rb')}
 r = requests.post(url， files=files)
 ```
 我们只需要在传递的字典中设置好文件名和文件对象即可，Requests 会帮你正确编码并发送。
 ### 获取图片
 由于图片也是一种二进制数据，所以获取图片可以这么写：
 ```python
 url = 'https：//images.pexels.com/photos/1562477/pexels-photo-1562477.jpeg'
 r = requests.get(url)
 with open('image.jpeg'， 'wb') as f：
    f.write(r.content)
 ```
 图片内容保存在 r.content 中，我们可以直接 write 到文件。
 ### Prepared Request
 如果要一次构造同一个请求发送多次，可以使用 Prepared Request：
 ```python
 url = 'https：//httpbin.org/post'
 data = {'key'：'value'}
 headers = {'User-Agent'： 'my-app'}
 request = requests.Request('POST'， url， data=data， headers=headers)
 prepared_request = request.prepare()
 s = requests.Session()
 response = s.send(prepared_request)
 ```
 ## 异步 Requests
 ### 基于 Gevent 的异步
 ```python
 import requests
 import gevent
 from gevent import monkey
 monkey.patch_all()
 urls = [
  'https：//www.python.org'，
  'https：//www.mi.com'，
  'https：//www.baidu.com'
 ]
 jobs = [gevent.spawn(requests.get， url) for url in urls]
 gevent.joinall(jobs)
 print([job.value.text for job in jobs])
 ```
 ### 基于 asyncio 的异步
 ```python
 import asyncio
 import requests
 async def download_site(url， session)：
    async with session.get(url) as response：
        print(f"Read {len(response.content)} from {url}")
 async def download_all_sites(sites)：
    async with requests.Session() as session：
        tasks = []
        for url in sites：
            task = asyncio.ensure_future(download_site(url， session))
            tasks.append(task)
        await asyncio.gather(*tasks)
 if __name__ == "__main__"：
    sites = [
        "https：//www.jython.org"，
        "http：//olympus.realpython.org/dice"，
    ] * 80
    asyncio.run(download_all_sites(sites))
 ```