温馨提示:本文翻译自stackoverflow.com，查看原文请点击：python - How to pivot a dataframe

group-by pandas pandas-groupby pivot python

python - 如何旋转数据框

发布于 2020-03-27 10:30:09

什么是支点？
我如何枢纽？
这是支点吗？
长格式到宽格式？

我已经看到很多有关数据透视表的问题。即使他们不知道他们在询问数据透视表，通常也是如此。几乎不可能写出涵盖枢纽各个方面的规范问答。

...但是我要去尝试一下。

现有问题和答案的问题在于，问题通常集中在OP难以推广的细微差别上，以便使用许多现有的良好答案。但是，没有一个答案想给出全面的解释（因为这是一项艰巨的任务）

Look a few examples from my google search

How to pivot a dataframe in Pandas?
- Good question and answer. But the answer only answers the specific question with little explanation.
pandas pivot table to data frame
- In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.
pandas pivoting a dataframe, duplicate rows
- Another decent question but the answer focuses on one method, namely pd.DataFrame.pivot

So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.

Setup

You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below.

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)

     key   row   item   col  val0  val1
0   key0  row3  item1  col3  0.81  0.04
1   key1  row2  item1  col2  0.44  0.07
2   key1  row0  item1  col0  0.77  0.01
3   key0  row4  item0  col2  0.15  0.59
4   key1  row0  item2  col1  0.81  0.64
5   key1  row2  item2  col4  0.13  0.88
6   key2  row4  item1  col3  0.88  0.39
7   key1  row4  item1  col1  0.10  0.07
8   key1  row0  item2  col4  0.65  0.02
9   key1  row2  item0  col2  0.35  0.61
10  key2  row0  item2  col1  0.40  0.85
11  key2  row4  item1  col2  0.64  0.25
12  key0  row2  item2  col3  0.50  0.44
13  key0  row4  item1  col4  0.24  0.46
14  key1  row3  item2  col3  0.28  0.11
15  key0  row3  item1  col1  0.31  0.23
16  key0  row0  item2  col3  0.86  0.01
17  key0  row4  item0  col3  0.64  0.21
18  key2  row2  item2  col0  0.13  0.45
19  key0  row2  item0  col4  0.37  0.70

Question(s)

Why do I get ValueError: Index contains duplicate entries, cannot reshape

How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?

col   col0   col1   col2   col3  col4
row                                  
row0  0.77  0.605    NaN  0.860  0.65
row2  0.13    NaN  0.395  0.500  0.25
row3   NaN  0.310    NaN  0.545   NaN
row4   NaN  0.100  0.395  0.760  0.24

How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

col   col0   col1   col2   col3  col4
row                                  
row0  0.77  0.605  0.000  0.860  0.65
row2  0.13  0.000  0.395  0.500  0.25
row3  0.00  0.310  0.000  0.545  0.00
row4  0.00  0.100  0.395  0.760  0.24

Can I get something other than mean, like maybe sum?

col   col0  col1  col2  col3  col4
row                               
row0  0.77  1.21  0.00  0.86  0.65
row2  0.13  0.00  0.79  0.50  0.50
row3  0.00  0.31  0.00  1.09  0.00
row4  0.00  0.10  0.79  1.52  0.24

Can I do more that one aggregation at a time?

       sum                          mean                           
col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
row                                                                
row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24

Can I aggregate over multiple value columns?

      val0                             val1                          
col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
row                                                                  
row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46

Can Subdivide by multiple columns?

item item0             item1                         item2                   
col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
row                                                                          
row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00

item      item0             item1                         item2                  
col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
key  row                                                                         
key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
     row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
     row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
     row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
     row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
     row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
     row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
     row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
     row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00

Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

col   col0  col1  col2  col3  col4
row                               
row0     1     2     0     1     1
row2     1     0     2     1     2
row3     0     1     0     2     0
row4     0     1     2     2     1

How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,

np.random.seed([3, 1415])
df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})        
df2        
   A   B
0  a   0
1  a  11
2  a   2
3  a  11
4  b  10
5  b  10
6  b  14
7  c   7

The expected should would look something like

      a     b    c
0   0.0  10.0  7.0
1  11.0  10.0  NaN
2   2.0  14.0  NaN
3  11.0   NaN  NaN

How do I flatten the multiple index to single index after pivot

From

   1|1  2|1  2|2               
a    2    1    1
b    2    1    0
c    1    0    0

提问者

piRSquared

被浏览

282

查看英文版

查看原文

198k 2019-12-19 07:01

We start by answering the first question:

Question 1

Why do I get ValueError: Index contains duplicate entries, cannot reshape

This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

df.duplicated(['row', 'col']).any()

True

So when I pivot using

df.pivot(index='row', columns='col', values='val0')

I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

df.set_index(['row', 'col'])['val0'].unstack()

Here is a list of idioms we can use to pivot

pd.DataFrame.groupby + pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.
pd.DataFrame.pivot_table
- A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.
- Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
pd.DataFrame.set_index + pd.DataFrame.unstack
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
pd.DataFrame.pivot
- Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.
- Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
pd.crosstab
- This a specialized version of pivot_table and in it's purest form is the most intuitive way to perform several tasks.
pd.factorize + np.bincount
- This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
pd.get_dummies + pd.DataFrame.dot
- I use this for cleverly performing cross tabulation.

Examples

What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.

Question 3

How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

pd.DataFrame.pivot_table

fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it's the same as this answer without the fill_value

aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.

df.pivot_table(
    values='val0', index='row', columns='col',
    fill_value=0, aggfunc='mean')

col   col0   col1   col2   col3  col4
row                                  
row0  0.77  0.605  0.000  0.860  0.65
row2  0.13  0.000  0.395  0.500  0.25
row3  0.00  0.310  0.000  0.545  0.00
row4  0.00  0.100  0.395  0.760  0.24

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)

pd.crosstab

pd.crosstab(
    index=df['row'], columns=df['col'],
    values=df['val0'], aggfunc='mean').fillna(0)

Question 4

Can I get something other than mean, like maybe sum?

pd.DataFrame.pivot_table

df.pivot_table(
    values='val0', index='row', columns='col',
    fill_value=0, aggfunc='sum')

col   col0  col1  col2  col3  col4
row                               
row0  0.77  1.21  0.00  0.86  0.65
row2  0.13  0.00  0.79  0.50  0.50
row3  0.00  0.31  0.00  1.09  0.00
row4  0.00  0.10  0.79  1.52  0.24

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)

pd.crosstab

pd.crosstab(
    index=df['row'], columns=df['col'],
    values=df['val0'], aggfunc='sum').fillna(0)

Question 5

Can I do more that one aggregation at a time?

Notice that for pivot_table and crosstab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.

pd.DataFrame.pivot_table

df.pivot_table(
    values='val0', index='row', columns='col',
    fill_value=0, aggfunc=[np.size, np.mean])

     size                      mean                           
col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
row                                                           
row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)

pd.crosstab

pd.crosstab(
    index=df['row'], columns=df['col'],
    values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')

Question 6

Can I aggregate over multiple value columns?

pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely

df.pivot_table(
    values=['val0', 'val1'], index='row', columns='col',
    fill_value=0, aggfunc='mean')

      val0                             val1                          
col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
row                                                                  
row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)

Question 7

Can Subdivide by multiple columns?

pd.DataFrame.pivot_table

df.pivot_table(
    values='val0', index='row', columns=['item', 'col'],
    fill_value=0, aggfunc='mean')

item item0             item1                         item2                   
col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
row                                                                          
row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00

pd.DataFrame.groupby

df.groupby(
    ['row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)

Question 8

Can Subdivide by multiple columns?

pd.DataFrame.pivot_table

df.pivot_table(
    values='val0', index=['key', 'row'], columns=['item', 'col'],
    fill_value=0, aggfunc='mean')

item      item0             item1                         item2                  
col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
key  row                                                                         
key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
     row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
     row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
     row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
     row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
     row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
     row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
     row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
     row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00

pd.DataFrame.groupby

df.groupby(
    ['key', 'row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)

pd.DataFrame.set_index because the set of keys are unique for both rows and columns

df.set_index(
    ['key', 'row', 'item', 'col']
)['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)

Question 9

我是否可以汇总列和行一起出现的频率，又称为“交叉表”？

pd.DataFrame.pivot_table

df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')

    col   col0  col1  col2  col3  col4
row                               
row0     1     2     0     1     1
row2     1     0     2     1     2
row3     0     1     0     2     0
row4     0     1     2     2     1

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)

pd.crosstab
```
pd.crosstab(df['row'], df['col'])
```

pd.factorize + np.bincount

# get integer factorization `i` and unique values `r`
# for column `'row'`
i, r = pd.factorize(df['row'].values)
# get integer factorization `j` and unique values `c`
# for column `'col'`
j, c = pd.factorize(df['col'].values)
# `n` will be the number of rows
# `m` will be the number of columns
n, m = r.size, c.size
# `i * m + j` is a clever way of counting the 
# factorization bins assuming a flat array of length
# `n * m`.  Which is why we subsequently reshape as `(n, m)`
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
pd.DataFrame(b, r, c)

      col3  col2  col0  col1  col4
row3     2     0     0     1     0
row2     1     2     1     0     2
row0     1     0     1     2     1
row4     2     2     0     1     1

pd.get_dummies

pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))

      col0  col1  col2  col3  col4
row0     1     2     0     1     1
row2     1     0     2     1     2
row3     0     1     0     2     0
row4     0     1     2     2     1

问题10

如何通过仅旋转两列将DataFrame从长转换为宽？

第一步是为每行分配一个数字-该数字将成为透视结果中该值的行索引。使用GroupBy.cumcount以下命令完成此操作：

df2.insert(0, 'count', df.groupby('A').cumcount())
df2

   count  A   B
0      0  a   0
1      1  a  11
2      2  a   2
3      3  a  11
4      0  b  10
5      1  b  10
6      2  b  14
7      0  c   7

第二步是使用新创建的列作为要调用的索引DataFrame.pivot。

df2.pivot(*df)
# df.pivot(index='count', columns='A', values='B')

A         a     b    c
count                 
0       0.0  10.0  7.0
1      11.0  10.0  NaN
2       2.0  14.0  NaN
3      11.0   NaN  NaN

问题11

我如何将多重索引展平为单索引 pivot

如果columns输入object字符串join

df.columns = df.columns.map('|'.join)

其他 format

df.columns = df.columns.map('{0[0]}|{0[1]}'.format)

MaxU 2017-12-15 18:31:13

您可以考虑扩展正式文档吗？

Monica Heddneck 2019-09-28 02:06:51

问题10的答案发生了什么？我懂了KeyError: 'A'。还有更多答案吗？

piRSquared 2019-09-28 04:19:45

@MonicaHeddneck我将再次进行审查，并在必要时进行更新。但是，'A'假设'A'您的数据框中有一个要分组的列。

Anil Kumar 2020-01-22 20:59:09

我可以汇总多个值列吗？答案将适用于不同数据类型的列。例如：value = ['val0'，'val1']，其中val0是int而val1是string

python - 如何旋转数据框

Setup

Question(s)

Question 1

Examples

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

问题10

问题11

相关问题

热门github