本文主要是介绍aiohttp遇到非法字符的处理(UnicodeDecodeError: 'utf-8' codec can't decode bytes in position......),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
这个问题困扰了我将近一天时间,如果使用text()函数会一直报“UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 24461-24462: invalid continuation byte”的错误,如果使用read()函数以二进制输出在后面解析的时候中文是乱码,网上查了很多资料,主要也是自己的疏忽自己看了源码,一直纠结在编码问题忽略了另一个带默认值的参数
下面是解决方案:
import aiohttp
import asyncioheaders = {"Upgrade-Insecure-Requests": "1","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8","Accept-Encoding": "gzip, deflate, sdch, br","Accept-Language": "zh-CN,zh;q=0.8",}async def ss():async with aiohttp.ClientSession() as session:async with session.get('http://www.iteye.com/blogs/tag/java',headers=headers) as resp:print(resp.status)# ignore,则会忽略非法字符,默认是strict,代表遇到非法字符时抛出异常d = (await resp.text("utf-8","ignore"))# d = await resp.read()# d = await resp.text()cc(d)def cc(v):print(v)soup = BeautifulSoup(v, "lxml")contents = soup.select("div.content")for conten in contents:articleAuthor = conten.select("div.blog_info > a")if articleAuthor:print(articleAuthor)articleAuthor = articleAuthor[0]else:articleAuthor = ""print(articleAuthor)loop = asyncio.get_event_loop()
tasks = [ss() ]
loop.run_until_complete(asyncio.gather(*tasks))
这样结果中文就正常显示("utf-8"其实不是必须的因为默认就是utf-8):
如果是await resp.text()就直接报错:
如果是await resp.read(),在解析时中文乱码:
text的源码:
@asyncio.coroutine
def text(self, encoding=None, errors='strict'):"""Read response payload and decode."""if self._content is None:yield from self.read()if encoding is None:encoding = self._get_encoding()return self._content.decode(encoding, errors=errors)
默认的参数就是strict,代表遇到非法字符时抛出异常;
如果设置为ignore,则会忽略非法字符;
这篇关于aiohttp遇到非法字符的处理(UnicodeDecodeError: 'utf-8' codec can't decode bytes in position......)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!