Warm tip: This article is reproduced from stackoverflow.com, please click
dataframe pandas python stata

View Stata variable labels in Pandas

发布于 2020-05-15 13:29:15

Stata .dta files include labels/descriptions for each column, which can be viewed in Stata using the describe command. For example, the adults and kids variables in this online dataset, have descriptions number of adults in household and number of children in household, respectively:

clear
use http://www.principlesofeconometrics.com/stata/alcohol.dta

describe

Contains data from http://www.principlesofeconometrics.com/stata/alcohol.dta
  obs:         1,000                          
 vars:             4                          10 Nov 2007 11:33
 size:         5,000                          (_dta has notes)
-------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------------------------------
adults          byte    %8.0g                 number of adults in household
kids            byte    %8.0g                 number of children in household
income          int     %8.0g                 weekly income
consume         byte    %8.0g                 =1 if consume alcohol, =0 otherwise
-------------------------------------------------------------------------------------------------------------------------------------
Sorted by: 

Those descriptions do not show up in Pandas, for example with describe():

df = pd.read_stata('http://www.principlesofeconometrics.com/stata/alcohol.dta')
df

     adults  kids  income  consume
0         2     2     758        1
1         2     3    1785        1
2         3     0    1200        1
..      ...   ...     ...      ...
997       2     0    1383        1
998       2     2     816        0
999       2     2     387        0

df.describe()

            adults         kids       income      consume
count  1000.000000  1000.000000  1000.000000  1000.000000
mean      2.012000     0.722000   649.528000     0.766000
std       0.815181     1.078833   460.657826     0.423584
min       1.000000     0.000000    12.000000     0.000000
25%       2.000000     0.000000   295.000000     1.000000
50%       2.000000     0.000000   562.500000     1.000000
75%       2.000000     1.000000   887.500000     1.000000
max       6.000000     5.000000  3846.000000     1.000000

Is there a way to view this information after loading it to a Pandas DataFrame using read_stata()?

Questioner
Max Ghenis
Viewed
38
Pearly Spencer 2020-03-03 17:42

Using Stata's toy dataset auto as an example:

sysuse auto, clear

describe

Contains data from auto.dta
  obs:            74                          1978 Automobile Data
 vars:            12                          13 Apr 2014 17:45
 size:         3,182                          (_dta has notes)
-------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------------------------------
make            str18   %-18s                 Make and Model
price           int     %8.0gc                Price
mpg             int     %8.0g                 Mileage (mpg)
rep78           int     %8.0g                 Repair Record 1978
headroom        float   %6.1f                 Headroom (in.)
trunk           int     %8.0g                 Trunk space (cu. ft.)
weight          int     %8.0gc                Weight (lbs.)
length          int     %8.0g                 Length (in.)
turn            int     %8.0g                 Turn Circle (ft.)
displacement    int     %8.0g                 Displacement (cu. in.)
gear_ratio      float   %6.2f                 Gear Ratio
foreign         byte    %8.0g      origin     Car type
-------------------------------------------------------------------------------------------------------------------------------------
Sorted by: foreign

The following works for me:

import pandas as pd
data = pd.read_stata('auto.dta', iterator = True)
labels = data.variable_labels()
labels

Out[5]: 
{'make': 'Make and Model',
 'price': 'Price',
 'mpg': 'Mileage (mpg)',
 'rep78': 'Repair Record 1978',
 'headroom': 'Headroom (in.)',
 'trunk': 'Trunk space (cu. ft.)',
 'weight': 'Weight (lbs.)',
 'length': 'Length (in.)',
 'turn': 'Turn Circle (ft.) ',
 'displacement': 'Displacement (cu. in.)',
 'gear_ratio': 'Gear Ratio',
 'foreign': 'Car type'}