pandas（二）：factorize实现标称型数据数值化

本文主要是介绍pandas（二）：factorize实现标称型数据数值化，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

一、factorize()

官网说明

This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function pandas.factorize(), and as a method Series.factorize() and Index.factorize().

pandas.factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None)
Encode input values as an enumerated type or categorical variable

Parameters：

values：sequence

A 1-D sequence. Sequences that aren’t pandas objects are coerced to ndarrays before factorization.

sort：bool, default False

Sort uniques and shuffle codes to maintain the relationship.

na_sentinel：int or None, default -1

Value to mark “not found”. If None, will not drop the NaN from the uniques of the values.

Changed in version 1.1.2.

size_hint：int, optional

Hint to the hashtable sizer.

Returns

codes：ndarray

An integer ndarray that’s an indexer into uniques. uniques.take(codes) will have the same values as values.

uniques：ndarray, Index, or Categorical

The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned.

个人理解

factorize函数可以将Series中的标称型数据映射称为一组数字，相同的标称型映射为相同的数字。即它把字符串映射成的数字的规则是先看见的小，后看见的大。意思就是这一列的第一行，必定为0，第二行如果和第一行的取值不同，就为1，否则就是0.以此类推。factorize函数的返回值是一个tuple（元组），元组中包含两个元素。第一个元素是一个array，其中的元素是标称型元素映射为的数字；第二个元素是Index类型，其中的元素是所有标称型元素，没有重复。