Quantcast
Channel: stata培训 –数据分析
Viewing all 94 articles
Browse latest View live

STATA之多元回归结果导出_stata多元回归分析

$
0
0

STATA之多元回归结果导出

关键词: stata多元回归分析stata多元线性回归stata 多元回归操作stata做多元线性回归、数据分析师、数据分析

STATA作为常用的统计软件,其功能和操作备受众多人喜爱。

工具/原料

STATA软件

多元回归方程需要的数据

方法/步骤

导入数据。点击“File”→“import”→“Excel spreadseet”,如图所示。

STATA之多元回归结果导出_stata多元回归分析

点击“Browse”,将数据第一行作为“变量名称”,点击“Import first row as variable names”,如图所示。

STATA之多元回归结果导出_stata多元回归分析

在“Command”中输入命令“regress y x1 x2 x3”,如图所示。

STATA之多元回归结果导出_stata多元回归分析

结果如图所示。

STATA之多元回归结果导出_stata多元回归分析

注意事项

数据不能是字符形式

操作过程中及变量名不能出现汉文

转载请注明:数据分析 » STATA之多元回归结果导出_stata多元回归分析


使用Stata进行meta分析实例_stata meta分析模块

$
0
0

使用Stata进行meta分析实例

关键词: stata meta分析模块stata meta分析教程数据分析师stata做meta分析应用stata做meta分析

. sum

   Variable |       Obs        Mean    Std. Dev.       Min       Max
————-+——————————————————–
        day |         7           1           0          1          1
       year |         7    2005.286    7.040698       1991      2011
    country |         0
      study |         0
         dn |         7    16.85714    11.66803          6         38
————-+——————————————————–
         dm |         7    14.02571    3.834401        8.5         19
         ds |         7    5.448572    3.270578          1         10
         sn |         7    23.28571    14.02039         10         51
         sm |         7    12.95286    4.237431       7.83       19.9
         ss |         7    5.165714    2.857661        1.8          9

. list

    +—————————————————————————-+
    | day   year   country          study   dn      dm     ds  sn     sm     ss |
    |—————————————————————————-|
1. |   1   2011      中国         陈裕胜   11    14.4      1  10   12.7    1.8 |
2. |   1   2011      中国         梁道业    9     8.5   2.69   23   7.83   2.36 |
3. |   1   2009      中国           徐波    8    18.7    8.1  12   19.9      8 |
4. |   1   2007      中国           顾勤    6   12.48   6.55   14   9.04    5.3 |
5. |   1   2006      中国         杨从山   24    11.1    6.8   26   12.2    6.7 |
    |—————————————————————————-|
6. |   1   2002      德国    Samir Sakka   22      14      3   27     12      3 |
7. |   1   1991      美国   Dan Schuller   38      19     10   51     17      9 |
    +—————————————————————————-+

. metan  dn- ss,fixed lcols( day year country) texts(200) boxsca(200) nowt

          Study     |     SMD   [95% Conf. Interval]    
———————+—————————————————
1                    |  1.184       0.249     2.120       
1                    |  0.273      -0.501     1.047       
1                    | -0.149      -1.045     0.747       
1                    |  0.606      -0.370     1.583       
1                    | -0.163      -0.719     0.393       
1                    |  0.667       0.088     1.246       
1                    |  0.212      -0.209     0.633       
———————+—————————————————
I-V pooled SMD       |  0.290       0.047     0.532       
———————+—————————————————

Heterogeneity chi-squared =   9.15 (d.f. = 6) p = 0.165
I-squared (variation in SMD attributable to heterogeneity) =  34.4%

Test of SMD=0 : z=   2.34 p = 0.019

使用Stata进行ROC曲线分析实例分析

. metafunnel  _ES _seES

使用Stata进行ROC曲线分析实例分析
. metabias  _ES _seES,begg

Note: data input format theta se_theta assumed.

Begg’s test for small-study effects:
Rank correlation between standardized intervention effect and its standard error

adj. Kendall’s Score (P-Q) =       7
         Std. Dev. of Score =    6.66
          Number of Studies =       7
                         z  =    1.05
                   Pr > |z| =   0.293
                         z  =    0.90 (continuity corrected)
                   Pr > |z| =   0.368 (continuity corrected)

. metabias  _ES _seES,egger

Note: data input format theta se_theta assumed.

Egger’s test for small-study effects:
Regress standard normal deviate of intervention
effect estimate against its standard error

Number of studies =  7                                 Root MSE      =   1.269
——————————————————————————
    Std_Eff |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
————-+—————————————————————-
      slope |  -.1250773     .52751    -0.24   0.822    -1.481085    1.23093
       bias |    1.32596   1.609626     0.82   0.448    -2.811715   5.463634
——————————————————————————

Test of H0: no small-study effects          P = 0.448

转载请注明:数据分析 » 使用Stata进行meta分析实例_stata meta分析模块

使用Stata进行ROC曲线分析实例分析-roc曲线分析实例

$
0
0

使用Stata进行ROC曲线分析实例分析

关键词:roc曲线分析实例、stata roc曲线stata中roc曲线说明stata参数法roc曲线数据分析师

roctab mods pre,g

使用Stata进行ROC曲线分析实例分析-roc曲线分析实例

. roccomp mods pre ldh cr abl,g
使用Stata进行ROC曲线分析实例分析-roc曲线分析实例

. roccomp mods  pre ldh cr abl

                             ROC                    -Asymptotic Normal–
                  Obs       Area     Std. Err.      [95% Conf. Interval]
————————————————————————-
pre                113     0.9273       0.0268        0.87485     0.97977
ldh                113     0.9034       0.0285        0.84752     0.95921
cr                 113     0.7998       0.0633        0.67580     0.92378
abl                113     0.1483       0.0444        0.06136     0.23528
————————————————————————-
Ho: area(pre) = area(ldh) = area(cr) = area(abl)
   chi2(3) =   189.39       Prob>chi2 =   0.0000

. rocgold mods  pre ldh cr abl

——————————————————————————-
                      ROC                                           Bonferroni
                     Area     Std. Err.       chi2    df  Pr>chi2     Pr>chi2
——————————————————————————-
pre (standard)      0.9273       0.0268
ldh                 0.9034       0.0285      0.6873     1   0.4071      1.0000
cr                  0.7998       0.0633      4.9712     1   0.0258      0.0773
abl                 0.1483       0.0444    135.4836     1   0.0000      0.0000
——————————————————————————-

. roctab mods pre,d

Detailed report of sensitivity and specificity
——————————————————————————
                                          Correctly
Cutpoint      Sensitivity   Specificity   Classified          LR+          LR-
——————————————————————————
( >= .00382 )     100.00%         0.00%       24.78%       1.0000    
( >= .0053 )      100.00%         1.18%       25.66%       1.0119       0.0000
( >= .00681 )     100.00%         2.35%       26.55%       1.0241       0.0000
( >= .00703 )     100.00%         3.53%       27.43%       1.0366       0.0000
( >= .00819 )     100.00%         4.71%       28.32%       1.0494       0.0000
( >= .0102 )      100.00%         5.88%       29.20%       1.0625       0.0000
( >= .01031 )     100.00%         7.06%       30.09%       1.0759       0.0000
( >= .01236 )     100.00%         9.41%       31.86%       1.1039       0.0000
( >= .0135 )      100.00%        10.59%       32.74%       1.1184       0.0000
( >= .01361 )     100.00%        11.76%       33.63%       1.1333       0.0000
( >= .01409 )     100.00%        12.94%       34.51%       1.1486       0.0000
( >= .01531 )     100.00%        14.12%       35.40%       1.1644       0.0000
( >= .01704 )     100.00%        15.29%       36.28%       1.1806       0.0000
( >= .01739 )     100.00%        16.47%       37.17%       1.1972       0.0000
( >= .0175 )      100.00%        17.65%       38.05%       1.2143       0.0000
( >= .01765 )     100.00%        18.82%       38.94%       1.2319       0.0000
( >= .01915 )     100.00%        20.00%       39.82%       1.2500       0.0000
( >= .02164 )     100.00%        21.18%       40.71%       1.2687       0.0000
( >= .02345 )     100.00%        22.35%       41.59%       1.2879       0.0000
( >= .02392 )     100.00%        23.53%       42.48%       1.3077       0.0000
( >= .02456 )     100.00%        24.71%       43.36%       1.3281       0.0000
( >= .02502 )     100.00%        25.88%       44.25%       1.3492       0.0000
( >= .02676 )     100.00%        27.06%       45.13%       1.3710       0.0000
( >= .02719 )     100.00%        28.24%       46.02%       1.3934       0.0000
( >= .02971 )     100.00%        29.41%       46.90%       1.4167       0.0000
( >= .03233 )     100.00%        30.59%       47.79%       1.4407       0.0000
( >= .03243 )     100.00%        31.76%       48.67%       1.4655       0.0000
( >= .03558 )     100.00%        32.94%       49.56%       1.4912       0.0000
( >= .03689 )     100.00%        34.12%       50.44%       1.5179       0.0000
( >= .03716 )     100.00%        35.29%       51.33%       1.5455       0.0000
( >= .03743 )     100.00%        36.47%       52.21%       1.5741       0.0000
( >= .03931 )     100.00%        37.65%       53.10%       1.6038       0.0000
( >= .04133 )     100.00%        38.82%       53.98%       1.6346       0.0000
( >= .04146 )     100.00%        40.00%       54.87%       1.6667       0.0000
( >= .04217 )     100.00%        41.18%       55.75%       1.7000       0.0000
( >= .04241 )     100.00%        42.35%       56.64%       1.7347       0.0000
( >= .04371 )     100.00%        43.53%       57.52%       1.7708       0.0000
( >= .04376 )     100.00%        44.71%       58.41%       1.8085       0.0000
( >= .04423 )     100.00%        45.88%       59.29%       1.8478       0.0000
( >= .04763 )     100.00%        47.06%       60.18%       1.8889       0.0000
( >= .04788 )     100.00%        48.24%       61.06%       1.9318       0.0000
( >= .0508 )      100.00%        49.41%       61.95%       1.9767       0.0000
( >= .05191 )     100.00%        50.59%       62.83%       2.0238       0.0000
( >= .05406 )     100.00%        51.76%       63.72%       2.0732       0.0000
( >= .05513 )     100.00%        52.94%       64.60%       2.1250       0.0000
( >= .05603 )     100.00%        54.12%       65.49%       2.1795       0.0000
( >= .05649 )     100.00%        55.29%       66.37%       2.2368       0.0000
( >= .05758 )     100.00%        56.47%       67.26%       2.2973       0.0000
( >= .0593 )      100.00%        57.65%       68.14%       2.3611       0.0000
( >= .05947 )      96.43%        57.65%       67.26%       2.2768       0.0620
( >= .06098 )      96.43%        58.82%       68.14%       2.3418       0.0607
( >= .0614 )       96.43%        60.00%       69.03%       2.4107       0.0595
( >= .07237 )      92.86%        60.00%       68.14%       2.3214       0.1190
( >= .07364 )      92.86%        61.18%       69.03%       2.3918       0.1168
( >= .07581 )      92.86%        62.35%       69.91%       2.4665       0.1146
( >= .07946 )      92.86%        63.53%       70.80%       2.5461       0.1124
( >= .08124 )      92.86%        64.71%       71.68%       2.6310       0.1104
( >= .08678 )      92.86%        65.88%       72.57%       2.7217       0.1084
( >= .08864 )      92.86%        67.06%       73.45%       2.8189       0.1065
( >= .09364 )      92.86%        68.24%       74.34%       2.9233       0.1047
( >= .09599 )      92.86%        69.41%       75.22%       3.0357       0.1029
( >= .10633 )      89.29%        69.41%       74.34%       2.9190       0.1544
( >= .10826 )      89.29%        70.59%       75.22%       3.0357       0.1518
( >= .10833 )      89.29%        71.76%       76.11%       3.1622       0.1493
( >= .11039 )      89.29%        72.94%       76.99%       3.2997       0.1469
( >= .11556 )      85.71%        72.94%       76.11%       3.1677       0.1959
( >= .12284 )      85.71%        74.12%       76.99%       3.3117       0.1927
( >= .12624 )      85.71%        75.29%       77.88%       3.4694       0.1897
( >= .12727 )      85.71%        76.47%       78.76%       3.6429       0.1868
( >= .12873 )      85.71%        77.65%       79.65%       3.8346       0.1840
( >= .13795 )      85.71%        78.82%       80.53%       4.0476       0.1812
( >= .1432 )       85.71%        80.00%       81.42%       4.2857       0.1786
( >= .14782 )      82.14%        80.00%       80.53%       4.1071       0.2232
( >= .14949 )      82.14%        81.18%       81.42%       4.3638       0.2200
( >= .1504 )       82.14%        82.35%       82.30%       4.6548       0.2168
( >= .16177 )      82.14%        83.53%       83.19%       4.9872       0.2138
( >= .16314 )      82.14%        84.71%       84.07%       5.3709       0.2108
( >= .17659 )      82.14%        85.88%       84.96%       5.8185       0.2079
( >= .17736 )      78.57%        85.88%       84.07%       5.5655       0.2495
( >= .17874 )      78.57%        87.06%       84.96%       6.0714       0.2461
( >= .19822 )      78.57%        88.24%       85.84%       6.6786       0.2429
( >= .23909 )      78.57%        89.41%       86.73%       7.4206       0.2397
( >= .24636 )      75.00%        89.41%       85.84%       7.0833       0.2796
( >= .26112 )      75.00%        90.59%       86.73%       7.9687       0.2760
( >= .34388 )      75.00%        91.76%       87.61%       9.1071       0.2724
( >= .41857 )      75.00%        92.94%       88.50%      10.6250       0.2690
( >= .44136 )      71.43%        92.94%       87.61%      10.1190       0.3074
( >= .51026 )      71.43%        94.12%       88.50%      12.1429       0.3036
( >= .52563 )      67.86%        94.12%       87.61%      11.5357       0.3415
( >= .54136 )      67.86%        95.29%       88.50%      14.4197       0.3373
( >= .54198 )      64.29%        95.29%       87.61%      13.6607       0.3748
( >= .59103 )      64.29%        96.47%       88.50%      18.2143       0.3702
( >= .60721 )      64.29%        97.65%       89.38%      27.3214       0.3657
( >= .69305 )      64.29%        98.82%       90.27%      54.6430       0.3614
( >= .74518 )      60.71%        98.82%       89.38%      51.6073       0.3975
( >= .84005 )      60.71%       100.00%       90.27%                    0.3929
( >= .84678 )      57.14%       100.00%       89.38%                    0.4286
( >= .85097 )      53.57%       100.00%       88.50%                    0.4643
( >= .86829 )      50.00%       100.00%       87.61%                    0.5000
( >= .92791 )      46.43%       100.00%       86.73%                    0.5357
( >= .92918 )      42.86%       100.00%       85.84%                    0.5714
( >= .93128 )      39.29%       100.00%       84.96%                    0.6071
( >= .93165 )      35.71%       100.00%       84.07%                    0.6429
( >= .94542 )      32.14%       100.00%       83.19%                    0.6786
( >= .95914 )      28.57%       100.00%       82.30%                    0.7143
( >= .99376 )      25.00%       100.00%       81.42%                    0.7500
( >= .99519 )      21.43%       100.00%       80.53%                    0.7857
( >= .99814 )      17.86%       100.00%       79.65%                    0.8214
( >= .99972 )      14.29%       100.00%       78.76%                    0.8571
( >= .99992 )       7.14%       100.00%       76.99%                    0.9286
( >= .99995 )       3.57%       100.00%       76.11%                    0.9643
( >  .99995 )       0.00%       100.00%       75.22%                    1.0000
—————————————————————————-

转载请注明:数据分析 » 使用Stata进行ROC曲线分析实例分析-roc曲线分析实例

使用Stata进行生存分析实例_stata 生存分析

$
0
0

使用Stata进行生存分析实例

关键词:stata 生存分析stata做生存分析stata 生存分析步骤stata做meta分析教程数据分析师

. stset time,f(outcome)

    failure event:  outcome != 0 & outcome < .
obs. time interval:  (0, time]
exit on or before:  failure

——————————————————————————
      33  total obs.
       0  exclusions
——————————————————————————
      33  obs. remaining, representing
      18  failures in single record/single failure data
   11370  total analysis time at risk, at risk from t =         0
                            earliest observed entry t =         0
                                 last observed exit t =      1045

. sts test treat,logrank

        failure _d:  outcome
  analysis time _t:  time
Log-rank test for equality of survivor functions

     |   Events         Events
treat |  observed       expected
——+————————-
1     |        14           8.57
2     |         4           9.43
——+————————-
Total |        18          18.00

           chi2(1) =       6.71
           Pr>chi2 =     0.0096

. sts graph,by(treat)

        failure _d:  outcome
  analysis time _t:  time

. 使用Stata进行生存分析实例_stata 生存分析

转载请注明:数据分析 » 使用Stata进行生存分析实例_stata 生存分析

stata中处理面板数据、工具变量选择和HAUSMAN检验的若干问题

$
0
0

stata中处理面板数据、工具变量选择和HAUSMAN检验的若干问题

第一节 关于面板数据PANEL DATA
1、面板数据回归为什么好
一般而言,面板数据模型的误差项由两部分组成,一部分是与个体观察单位有关的,它概括了所有影响被解释变量,但不随时间变化的因素,因此,面板数据模型也常常被成为非观测效应模型;另外一部分概括了因截面因时间而变化的不可观测因素,通常被成为特异性误差或特异扰动项(事实上这第二部分误差还可分成两部分,一部分是不因截面变化但随时间变化的非观测因素对应的误差项Vt,这一部分一般大家的处理办法是通过在模型中引入时间虚拟变量来加以剥离和控制,另一部分才是因截面因时间而变化的不可观测因素。不过一般计量经济学的面板数据分析中都主要讨论两部分,在更高级一点的统计学或计量经济学中会讨论误差分量模型,它一般讨论三部分误差)。
非观测效应模型一般根据对时不变非观测效应的不同假设可分为固定效应模型和随机效应模型。传统上,大家都习惯这样分类:如果把非观测效应看做是各个截面或个体特有的可估计参数,并且不随时间而变化,则模型为固定效应模型;如果把非观测效应看作随机变量,并且符合一个特定的分布,则模型为随机效应模型。
不过,上述定义不是十分严谨,而且一个非常容易让人产生误解的地方是似乎固定效应模型中的非观测效应是随时间不变的,是固定的,而随机效应模型中的非观测效应则不是固定的,而是随时间变化的。
一个逻辑上比较一致和严谨,并且越来越为大家所接受的假设是(参见Wooldridge的教材和Mundlak1978年的论文),不论固定效应还是随机效应都是随机的,都是概括了那些没有观测到的,不随时间而变化的,但影响被解释变量的因素(尤其当截面个体比较大的时候,这种假设是比较合理的)。非观测效应究竟应假设为固定效应还是随机效应,关键看这部分不随时间变化的非观测效应对应的因素是否与模型中控制的观测到的解释变量相关,如果这个效应与可观测的解释变量不相关,则这个效应成为随机效应。这也正是HAUSMAN设定检验所需要检验的假说。
非观测效应模型因为对非观测效应假设的不同,因为使用面板数据信息的不同,可以用不同方法来估计并且得到不同的估计量,一般有四个:
(1)组内估计量(WITHIN ESTIMATOR)(FE或FD: First Difference)
(2)组间估计量(BETWEEN ESTIMATOR)
(3)混合OLS估计量(POOLED OLS ESTIMATOR)
(4)随机效应估计量(RE,GLS或FGLS估计量)
这四个估计量因为假设和使用信息的不同而不同,各有优劣势,相互之间也有密切关系。3和4分别是1和2的加权平均;4在特定的假设分别可以转化成1和3;如果HAUSMAN检验表明4和1没有区别的时候意味着1和2没有区别。
RE假设未观察因素与解释变量是正交的,只不过在未观察因素里有两个部分,一是
* 此短文适用于对于面板数据和工具变量已经有初步了解的人士,阅读过中级教材的相关内容。本文仅供参考,如果存在错误,请与minglu73@263.net联系,以便及时纠正。请原谅中英文混用。中国科学院的徐志刚博士一一指明了此文存在的错误,并且对原文中存在的不足作了大量的补充,特表示感谢。
与个体单位有关的,二是完全随机的,RE在做估计的时候,是用这两个部分的方差计算出一个指数λ,来做quasi-demean,也就是说在去平均的时候是用原值的y或x减去λ乘以y或x的均值,然后用GLS估计。极端地,当λ为0时,非观测效应是一个常数,并且所有个体都一样,就等价于Pooled OLS,当λ为1时,说明完全随机的部分可以忽略,所有未观察因素都是与单位有关的,于是就等价于FE。但FE不需要假定未观察因素与解释变量是正交的,在做FE时,固定效应都被差分掉了,所以也可得到consistent的结果。
PANEL数据的好处之一是,如果未观察到的是固定效应,那么在做DEMEAN时,未观察因素就被差分掉了。这样就可以减少由于未观察的因素可能与解释变量相关而导致的内生性问题。
2、那么PANEL的FE或RE分析就避免了内生性问题吗?
只能说好一些,如果内生的问题只是由于与单位有关的并不随时间变化的遗漏变量与解释变量有关造成的,这时,数据的差分就解决了问题,但是,别忘记还有一部分误差,如果这部分误差里包含的因素也可能影响解释变量,那么,差分只能解决前面讲的问题,由随机项里包括的因素与解释变量之间的关系导致的内生性问题还可能存在。
3、怎么办?
找IV解决。类似于在OLS基础上找IV,但对PANEL的工具应该具有PANEL结构,除非你基础的估计没有使用PANEL的方法,比如说对数据用了pooled OLS方法,但能够用pooled OLS方法分析PANEL DATA的条件是很严格的。

第二节 关于工具变量选择
1, IV应该尽量是外生的(如历史/自然/气候/地理之类),它应该在理论上对被解释变量(以下称Y)没有直接影响,但应该通过影响被工具的变量(以下称X)而间接影响被解释变量Y。
2, 如果上述理论逻辑通的话,将内生变量X作为解释变量,IV和其他变量(X2)作为解释变量,看IV是否显著,它应该显著。如果选了多个IV,就用F TEST看其是否都不显著。同时,如果在多个IV中,有一个是确定为外生的,那么,可以用Sargan test of overidentifying restrictions来检验其他的IV是不是确实是外生的。
3, 如果上述都没有问题,做一下IV回归。完成后,用HAUSMAN检验,这个检验的原假说是IV回归与原回归(不用IV的回归)的变量的系数并没有显著的不同。看一下P值,如果P小于比如说0.1,或者0.05,那么,说明IV回归与原来的回归显著不同,原来的方程的确有内生性问题导致的估计偏误。反之,如果P很高,超过0.1,或0.05,那说明IV回归与原来的回归没有显著不同,无法拒绝原来的回归没有显著的内生问题导致的估计偏误的原假设。
4, 如果选择的IV本身就影响Y,那它就不能被作为IV。例如,Y在左边,右边是X(被工具的),X2,IV。当IV被放在方程右边时,它最好是不显著影响Y的。在Acemoglu(2001)里,他就检验了他们的IV是否直接影响被解释变量,结果说明不直接影响,于是这个IV是好的。当然,一个好的IV在前面的回归中也可能是显著的(不过一般如果理论和逻辑上IV是通过被工具的内生解释变量间接影响被解释变量的话,一般来说应该是被工具的内生解释变量使得IV不显著,或者由于两者相关性很高,两者都不显著),但判断的标准还只是t值。这个变量显著完全有可能是因为它影响了其他显著的变量(比如被工具的变量),如果是这样,当包括了IV在原方程中以后,其他变量(特别需要注意的是被工具的变量X)的系数可能发生明显变化。

第三节 关于HAUSMAN TSET(以下简称HT)的若干细节问题
具体参见Stata corporation, 2001, STATA 7 Reference H-P, Stata Press
1,含义:“The null hypothesis is that the efficient estimator is a consistent and efficient estimator of the true parameters. If it is, there should be no systematic difference between the coefficients of the efficient estimator and a comparison estimator that is known to be consistent for the true parameters. If the two models display a systematic difference in the estimated coefficients, then we have reason to doubt the assumptions on which the efficient estimator is based.”参见手册Stata corporation, 2001, STATA 7 Reference H-P, Stata Press。该方法是Hausman (1978)的程序化。所以,Hausman Test的命令(hausman)假设使用者知道需要比较的两个方程中哪一个是“无论原假说成立与否都是consistent”,哪一个“在原假说下不仅efficient而且consistent,但若原假说不成立,则inconsistent”,1然后,在STATA 8下,步骤是:
(1) obtain an estimator that is consistent whether or not the hypothesis is true;
(2) store the estimation results under a name-consistent using estimates store;
(3) obtain an estimator that is efficient (and consistent) under the hypothesis that
you are testing, but inconsistent otherwise;
(4) store the estimation results under a name-efficient using estimates store;
(5) use hausman to perform the test
hausman name-consistent name-efficient [, options]
举例:
(1)在关于是FE还是RE的检验中,原假说是非观测效应与解释变量不相关,备择假说是两者相关。FE是无论原假说成立与否都是consistent,而RE在原假说下是consistent,并且Asymptotically efficient(样本越大越有效),但如果原假说被拒绝,则RE不是consistent的 (Hausman, 1978)。
所以做法应该是(STATA 8的命令):
sort code year (排序)
tis year (时间变量是year)
iis code (表示单位的是code)
xtreg y x x2, fe(假设其中x是需要被工具的变量)
est store fixed (在STATA8里命令有变化,不再是HAUSMAN,SAVE了,这里的fixed实际上就是个变量名,用什么都行)
xtreg y x x2, re
hausman fixed
(2)比较OLS(或FE)和IV(或IVFE)
先做IV,因为,它无论如何都是consistent的,但OLS只有在原假设成立,即OLS结果与IV结果相同,内生性问题没有时,才是consistent的。所以,应该先做IV。
在老版本的STATA里,如果不加特殊说明,STATA就会默认为先写的回归命令得到的是总是一致的估计结果,后写的得到的是备择假设下不一致的估计结果。现在HAUSMAN命令
1 Refer to the appendix for the definition of unbiased, consistent and efficient.
规范了,而且扩展了。先跑哪个不重要,关键在于写最后HAUSMAN命令时候的顺序,而且如果最近跑的一个没有用EST存名字的话,要用“.”代替。
2.注意:
(1)对以上检验的理解的另一种方式是,我们先做一个假设条件要求更松的估计,然后再做一个假设条件更严格的。相比之下,IV(IVFE)比OLS(FE)要求更松。容易搞混的是FE比RE假设条件更松。RE假设未观察因素与解释变量是正交的,只不过在未观察因素里有两个部分,一是与个体单位有关的,二是完全随机的,RE在做估计的时候,是用这两个部分的方差计算出一个指数λ,来做quasi-demean,也就是说在去平均的时候是用原值的y或x减去λ乘以y或x的均值,然后用GLS估计。当λ为0时,就等价于pooled OLS,当λ为1时,说明完全随机的部分可以忽略,所有未观察因素都是与单位有关的,于是就等价于FE。但FE不需要假定未观察因素与解释变量是正交的,在做FE时,固定效应都被差分掉了,所以也可得到consistent的结果。当我们先做假设更严格的估计时,HT与一般检验一样,检验值大,P小,则原假说拒绝,应该接受假设更松的。在FE与RE的比较里,卡方大,接受FE。在OLS(FE)与IV(或IVFE)的比较里,当卡方值大时,P小时,拒绝原假说,IV结果和OLS(或FE)有不同,于是接受IV结果。
(2)从以上讨论可以看出,我们需要事先知道HT时两个方程的顺序和性质。在STATA7以下,当使用hausman命令时,它默认的顺利(缺省参数more)就是上面的顺序。如果你做的顺序相反,就应该加上参数,命令为hausman, less,如果没有写less,那么,STATA是不知道谁更efficient的,这时,你本来应该得到一个正的结果,就完全可能因为顺序错了,又忘记了参数less而得到一个相反的负数结果。
在STATA8里命令变化,可以变顺序,但要使用者注意正确使用参数:
The order of computing the two estimators may be reversed. You have to be careful
though to specify to hausman the models in the order “always consistent” first and
“efficient under H0” second. It is possible to skip storing the second model and refer
to the last estimation results by a period (.).
(3)在其他可比较的情况下,顺序并不重要(如果没有谁更有效的差别)
hausman may be used in any context. The order in which you specify the regressors in
each model does not matter, but it is your responsibility to assure that the estimators
and models are comparable, and satisfy the theoretical conditions (see (1) and (3)
above).
(4)当HT出现负值时
先看一下是不是方程顺序错了。如果没有错,那么在小样本数据下也并不是不可能得到负值。当HAUSMAN检验的X2值是负的时候,意思是强烈地表明两个被比较的回归结果系数相同(或者说无显著差异)的原假说不能被拒绝,尤其是小样本中很可能出现。这是STATA7的使用手册上的一个例子说的。但在STATA8里,又说,出现负值这种情况时,If this is the case, the Hausman test is undefined. Unfortunately, this is not a rare event. Stata supports a generalized Hausman test that overcomes both of these problems. See suest for details.可以通过help suest了解。
3.STATA命令
(1)比较FE和RE
sort code year (排序)
tis year (时间变量是year)
iis code (表示单位的是code)
xtreg y x x2, fe(假设其中x是需要被工具的变量)
est store fixed (在STATA8里命令有变化,不再是HAUSMAN,SAVE了,这里的fixed实际上就是个变量名,用什么都行)
xtreg y x x2, re
hausman fixed
(2)比较IVFE和IVRE
xtivreg y (x=iv) x2, fe
est store f1
xtivreg y (x=iv) x2, re
hausman f1
一般来说用不着这个比较,因为在这之前,你已经知道FE和RE谁好了,就将好的结果与它的IV结果比就行了。
(3)比较IVFE和FE
xtivreg y (x=iv) x2, fe
est store f2
xtreg y x x2, fe
hausman f2
再重复一遍,如果结果是P小,卡方大才说明IV回归是必要的,原来是有内生问题

第四节 举例
Acemoglu等人(2001)的文章是非常有代表性的使用工具变量的论文。他们试图验证制度对人均收入有影响,显然,直接做回归的话,制度就是内生的,因为好的制度可能在人均收入高的地方产生。他们找的工具变量是殖民地时代一个国家的死亡率,死亡率高欧洲人就不会定居下来,于是就会在当时建议掠夺性的制度,反之就会建立好的制度,而那时的制度对现在仍然有影响。
特别值得注意的是论文的6.3部分对于工具变量的有效性的检验。首先,他们用其他可行的变量作为替代来反复做IV回归,发现得到的结果与用死亡率作IV得到的结果基本相同。(这当然是不错的结果,但是,我认为这不是必要的,因为你并不一定能够找到其他的IV。)然后,他们将死亡率本身作为外生变量放在原回归里,发现它不显著地影响被解释变量,这说明它并不直接影响被解释变量。第三,他们把只用死亡率的IV结果和同时用死亡率和其他IV的结果进行卡方检验,发现它们没有显著不同,再次说明死亡率没有直接影响,也不是通过影响制度以外的其他变量影响被解释变量的。我认为这一步也不是必要的,因为如果你没有其他IV,这一步也就没有办法做了。
参考文献:
Acemoglu, Daron, Simon Johnson and James A. Robinson (2001) “The Colonial Origins of Comparative Development: An Empirical Investigation,” American Economic Review, December, Volume 91, Number 5, 1369-1401.
Stata corporation, 2001, STATA 7 Reference H-P, Stata Press.
Hausman, Jerry A. and William E. Taylor, 1981, “Panel Data and Unobservable Individual Effects,” Econometrica, Vol. 49, No. 6, 1377-1398.
Hausman, Jerry A., 1978, “Specification Tests in Econometrics,” Econometrica, Vol. 46, No. 6, 1251-1271.
Appendix:
(1) The definition of unbiased, consistent and efficient.

转载请注明:数据分析 » stata中处理面板数据、工具变量选择和HAUSMAN检验的若干问题

如何在stata中做GMM_如何用stata做gmm

$
0
0

如何在stata中做GMM

关键词:如何用stata做gmmstata gmm系统gmm stata 命令stata gmm 面板模型

广义矩估计(Generalized Method of Moments,即GMM)
一、解释变量内生性检验
首先检验解释变量内生性(解释变量内生性的Hausman 检验:使用工具变量法的前提是存在内生解释变量。Hausman 检验的原假设为:所有解释变量均为外生变量,如果拒绝,则认为存在内生解释变量,要用IV;反之,如果接受,则认为不存在内生解释变量,应该使用OLS。
reg ldi lofdi
estimates store ols
xtivreg ldi (lofdi=l.lofdi ldep lexr)
estimates store iv
hausman iv ols
(在面板数据中使用工具变量,Stata提供了如下命令来执行2SLS:xtivreg depvar [varlist1] (varlist_2=varlist_iv) (选择项可以为fe,re等,表示固定效应、随机效应等。详见help xtivreg)
如果存在内生解释变量,则应该选用工具变量,工具变量个数不少于方程中内生解释变量的个数。“恰好识别”时用2SLS。2SLS的实质是把内生解释变量分成两部分,即由工具变量所造成的外生的变动部分,以及与扰动项相关的其他部分;然后,把被解释变量对中的这个外生部分进行回归,从而满足OLS前定变量的要求而得到一致估计量。tptqtp
二、异方差与自相关检验
在球型扰动项的假定下,2SLS是最有效的。但如果扰动项存在异方差或自相关,
面板异方差检验:
xtgls enc invs exp imp esc mrl,igls panel(het)
estimates store hetero
xtgls enc invs exp imp esc mrl,igls
estimates store homo
local df = e(N_g) – 1
lrtest hetero homo, df(`df’)
面板自相关:xtserial enc invs exp imp esc mrl
则存在一种更有效的方法,即GMM。从某种意义上,GMM之于2SLS正如GLS之于OLS。好识别的情况下,GMM还原为普通的工具变量法;过度识别时传统的矩估计法行不通,只有这时才有必要使用GMM,过度识别检验(Overidentification Test或J Test):estat overid
三、工具变量效果验证
工具变量:工具变量要求与内生解释变量相关,但又不能与被解释变量的扰动项相关。由于这两个要求常常是矛盾的,故在实践上寻找合适的工具变量常常很困难,需要相当的想象力与创作性。常用滞后变量。
需要做的检验:
检验工具变量的有效性:
(1) 检验工具变量与解释变量的相关性
如果工具变量z与内生解释变量完全不相关,则无法使用工具变量法;如果与仅仅微弱地相关,。这种工具变量被称为“弱工具变量”(weak instruments)后果就象样本容量过小。检验弱工具变量的一个经验规则是,如果在第一阶段回归中,F统计量大于10,则可不必担心弱工具变量问题。Stata命令:estat first(显示第一个阶段回归中的统计量)
(2) 检验工具变量的外生性(接受原假设好)
在恰好识别的情况下,无法检验工具变量是否与扰动项相关。在过度识别(工具变量个数>内生变量个数)的情况下,则可进行过度识别检验(Overidentification Test),检验原假设所有工具变量都是外生的。如果拒绝该原假设,则认为至少某个变量不是外生的,即与扰动项相关。0H
Sargan统计量,Stata命令:estat overid
四、GMM过程
在Stata输入以下命令,就可以进行对面板数据的GMM估计。
. ssc install ivreg2 (安装程序ivreg2 )
. ssc install ranktest (安装另外一个在运行ivreg2 时需要用到的辅助程序ranktest)
. use “traffic.dta”(打开面板数据)
. xtset panelvar timevar (设置面板变量及时间变量)
. ivreg2 y x1 (x2=z1 z2),gmm2s (进行面板GMM估计,其中2s指的是2-step GMM)

转载请注明:数据分析 » 如何在stata中做GMM_如何用stata做gmm

stata之数据合并merge_stata合并数据

$
0
0

stata之数据合并merge

关键词:stata合并数据stata纵向合并数据stata横向合并数据stata如何合并数据stata怎么合并数据stata 面板数据合并

It is not uncommon for data, especially survey data, to come in multiple datasets (there are practical reasons for distributing datasets this way). When data is distributed in multiple files, the variables you want to use will often be scattered across several datasets. In order to work with information contained in two or more data files it is necessary to merge the segments into a new file that contains all of the variables you intend to work with.

First, you’ll need to figure out which variables you need, and which datasets contain them, you can do this by consulting the codebook. In addition to finding the variables you want for your analysis, you need to know the name of the id variable. An id variable is a variable that is unique to a case (observation) in the dataset. For a given individual, the id should be the same across all datasets. This will allow you to match the data from different datasets to the right person. For cross sectional data, this will typically be a single variable, in other cases, two or more variables are needed, this is commonly seen in panel data where subject id and date or wave are often needed to uniquely identify an observation. In order for Stata to merge the datasets, the id variable, or variables, will have to have the same name across all files. Additionally, if the variable is a string in one dataset, it must also be a string in all other datasets, and the same is true of numeric variables (the specific storage type is not important, as long as they are numerical). Once you have identified all the variables you need, and know what the id variable(s) are, you can begin to merge the datasets.

A simple example

A good first step is to describe our data. We can do this without actually opening file (this can be handy if the files are very large), all we have to do is open Stata and issue the command. The describe command gives us a lot of useful information, for our purposes the most important things it shows is that the variable id is numeric, and that the data are unsorted (the data must be sorted by the id variable or variables in order to merge). We also note that the variables we want from this dataset are in fact in the dataset. We would want to do this for all three of our datasets, but to save space we’ll only show the output for one of the datasets. Lets assume that the datasets are all unsorted and that the id variable has the same name (id) in all three datasets.

describe using http://statistics.ats.ucla.edu/stat/data/stata_faq_multmerge/data1Contains data highschool and beyond (200 cases) obs: 200 22 Jul 2008 13:47 vars: 4 size: 4,000 ------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------- id float %9.0g female float %9.0g fl race float .0g rl ses float %9.0g sl ------------------------------------------------------------------------------- Sorted by: (output ommitted) describe using http://statistics.ats.ucla.edu/stat/data/stata_faq_multmerge/data2 describe using http://statistics.ats.ucla.edu/stat/data/stata_faq_multmerge/data3

Since the datasets aren’t sorted, we will need to open each dataset, sort it, and then save the sorted dataset. Although we can use the data from a website easily within Stata, we cannot save it there. So note that all of the use commands pull datasets from our website, but save them to the directory “d:\data” on the users computer. The syntax below opens each dataset, sorts it by id and then saves it in a new location with a new name. If the dataset were already on our computer, we could save it in the same location, and, possibly even under the same name (replacing the old dataset), this is the users choice.

use http://statistics.ats.ucla.edu/stat/data/stata_faq_multmerge/data1, clear sort id save d:\data\data1_a, replace use http://statistics.ats.ucla.edu/stat/data/stata_faq_multmerge/data2, clear sort id save d:\data\data2_a, replace use http://statistics.ats.ucla.edu/stat/data/stata_faq_multmerge/data3, clear sort id save d:\data\data3_a, replace

Next, we actually merge the datasets. The merge command merges corresponding observations from the dataset currently in memory (called the master dataset) with those from a different Stata-format dataset (called the using dataset) into single observations. Assuming that we have data3 open from running the above syntax, that will be our master dataset. The first line of syntax below merges the data. Directly after the merge command is the name of the variable (or variables) that serve id variables, in this case id. Next is the argument using this tells Stata that we are done listing the id variables, and that what follows are the dataset(s) to be merged. The names are listed, with only spaces (no commas, etc.) between them. (Note, if the names or paths of your datasets include spaces, be sure to enclose them in quotation marks, i.e. ” “.) The next line of syntax saves our new merged dataset. Note that merge does not produce output.

merge id using d:\data\data1_a d:\data\data2_a save d:\data\merged_data

Now we can have a look at our newly merged dataset.

describe Contains data from data3.dta obs: 200 highschool and beyond (200 cases) vars: 14 24 Jul 2008 15:54 size: 10,200 (99.0% of memory free) --------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------- id float %9.0g schtyp float %9.0g scl type of school prog float %9.0g sel type of program female float %9.0g fl race float .0g rl ses float %9.0g sl _merge1 byte %8.0g _merge representing data1 read float %9.0g reading score write float %9.0g writing score math float %9.0g math score science float %9.0g science score socst float %9.0g social studies score _merge2 byte %8.0g _merge representing data2 _merge byte %8.0g --------------------------------------------------------------------------------------------- Sorted by:

In the above output we see the number of cases (200), which is correct. This is important since problems with the merge process often result in too few, or more often too many, cases in the merged dataset. We also see a list of the variables, which includes all the variables we want. The merged dataset contains three extra variables. These new variables are _merge, _merge1 and _merge2. The command merge will always generate at least one additional variable named _merge, when multiple files are specified in using, the command will produce additional _merge* variables, one for each of the datasets in the using list (in our case _merge1 and _merge2). These variables tell us where each observation in the dataset came from, this is useful as a check that your data merged properly. Sometimes an observation will not be present in a given dataset, this does not necessarily mean that something went wrong in the merge process, but this is another place where one can often get clues about what might have gone wrong in the merge process. Because in this example all of the datasets include all of the cases, and because the merge went as it should, the_merge* variables aren’t very  interesting. We will discuss these variables in greater detail below, when we deal with datasets where not all cases are present in all datasets.

Dropping unwanted variables

It is not uncommon to find that a large dataset contains many variables you are not going to use in your analysis. You can just leave those variables in your datasets when you merge them together, however, there are several reasons you might not want to do this. First, there is a limit on the number of variables Stata can handle. In Small Stata the limit is 99, in Stata/IC the limit is 2,047 and in Stata/SE and Stata/MP the limit is 32,767. These limits may see high, but if you merge multiple datasets, each with a large number of variables, you may exceed the limit for your type of Stata. The second reason you might not want to leave unneeded variables in your dataset is that each variable in memory uses additional system resources. A few extra variables isn’t going to hurt anything, but if you have a large number of unwanted variables, you may be wasting system resources. Below we show several methods of eliminating extra variables. One option is that when you open the datasets to sort them, you can also eliminate the variables you don’t plan to use. Depending on whether it is easier to list the variables you want you plan to use in your analysis, or to list those variables you don’t need, you can use the commands keep or drop. There is at least one additional option, you can open the datasets placing only those variables you need in memory. If I have a dataset containing a number of variables, but the only variables I need from it are id and read, I can add variable names to my use command as is shown in the first line of syntax below. This is particularly useful with very large files which require a lot of memory to open. Once you have opened the desired subset of variables, all you have to do is save the subset of data under a new name.

use id read using http://statistics.ats.ucla.edu/stat/data/stata_faq_multmerge/data2 save d:\data\data2_subset

In the above example, dataset2 contained the following variables: id, read, write, math, science, and socst. Assume that my analysis only requires the variables read and write, the only variables from dataset2 that are needed are those two and the variable id to merge the data with another dataset. Below are examples of the same sort of data preparation done above, using each of the techniques described. These techniques are equivalent, in that they produce the same end result. The efficiency of each technique varies depending on the situation.

Using keep to select variables:

use http://statistics.ats.ucla.edu/stat/data/stata_faq_multmerge/data2, clear keep id read write sort id save d:\data\data2_b

Using drop to remove unwanted variables:

use http://statistics.ats.ucla.edu/stat/data/stata_faq_multmerge/data2, clear drop math science socst sort id save d:\data\data2_b

Opening a subset of the data:

use id read write using http://statistics.ats.ucla.edu/stat/data/stata_faq_multmerge/data2, clear sort id save d:\data\data2_b

The _merge variables

The _merge variable(s) created by the merge command are easy to miss, but are very important. As discussed above, they tell us which dataset(s) each case came from. This is important because a lot of values that came from only one dataset may suggest a problem in the merge process. However, it is not uncommon for some cases to be in one dataset, but not another. In panel data this can occur when a given respondent did not participate in all the waves of the study. It can also occur for a number of other reasons. For example, a female respondent might appear in the subset of the data with demographic information, but be completely absent from the subset of data with information on female respondents’ children, because she does not have children. Because cases that are not present in all datasets are not necessarily a problem, in order for the information in _merge variables to be useful you need to know what to expect if the datasets merged correctly. In the example above, where the same 200 cases appeared in three datasets I would expect to see 200 cases, all of which came from all three of the datasets. If there are some cases missing from some of the datasets, then I would expect to see a certain number of cases that did not come from all the datasets, but I still need to make sure there aren’t too many that come from only some of the datasets. Having too many, or all, of the cases in your merged dataset come from one, or only a few of the datasets you’ve merged is a sign that the id variable does not match correctly across datasets. This is particularly common when the id variable is a string. Below we examine a dataset after merging to see if all went as expected.

The output below shows the file describe for a dataset data1m.dta, if we look at the number of observations (obs) we see that the dataset contains only 197 cases, but we know the study overall included 200 cases, so we know that there are three cases missing entirely from data1m. This is important information if we are going to correctly interpret the _merge variables later on. Finally we sort the data and save it under a new name. To save space we won’t show the output for the other two datasets (the code does appear below in case you want to run it). Assume that when we run describe on data2m and data3m we discover that they are also missing cases. Dataset data2m contains 196 observations, and dataset3m contains 197. It is possible that some of these cases are missing from all three datasets (i.e. the missing observations overlap across datasets) but it is also possible that all 200 observations occur in at least one of the datasets. We will find out once we merge the data.

use http://statistics.ats.ucla.edu/stat/data/stata_faq_multmerge/data1m, clear(highschool and beyond (200 cases)) describe Contains data from data1m.dta obs: 197 highschool and beyond (200 cases) vars: 4 24 Jul 2008 16:31 size: 3,940 (99.6% of memory free) --------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------- id float %9.0g female float %9.0g fl race float .0g rl ses float %9.0g sl --------------------------------------------------------------------------------------------- Sorted by: sort id save d:\data\data1m_a , replace use http://statistics.ats.ucla.edu/stat/data/stata_faq_multmerge/data2m, clear describe sort id save d:\data\data2m_a , replace use http://statistics.ats.ucla.edu/stat/data/stata_faq_multmerge/data3m, clear describe sort id save d:\data\data3m_a , replace

Once we have examined and sorted the datasets we can merge them. The syntax below does this, note that the command is the same as in the first example. By default, Stata will allow cases to come from any of the three datasets. There are options that will allow you to control which datasets the cases come from, you can find out about them by typing “help merge” (without the quotes) in Stata.

merge id using d:\data\data1m_a d:\data\data2m_a

As before, the merge command created three new variables _merge, _merge1, and _merge2. The variable _merge gives information about which cases were present in the master dataset, it takes on one of three values:

_merge = 1 The observation is present only in the master dataset
_merge = 2 The observation is present in one of the using datasets (but not the master datatset).
_merge = 3 The observation is present in at least two datasets, either master or using

When more than one dataset appears in the using list, merge creates additional _merge variables, one for each dataset listed in using (e.g. _merge1, _merge2). These variables are equal to 1 if the observation was present in the dataset associated with that variable, and zero otherwise.

We will start our examination of these variables with _merge. Below we have used tab to look at the variable _merge. The results show that we ended up with a total of 200 observations, which is what we expected. Looking at the breakdown, 198 observations were present in at least two of the three datasets (_merge=3). This includes both the cases that occur in all three datasets, and those that occur in only two out of the three. There is one case that is only present in the master dataset, that is, data3m (_merge=2). Finally there is one case that is only present in one of the using datasets, that is, one case exists in data1m or data2m that does not exist in either of the other two datasets. Because there are two of using variables, the information from _mergedo not give us complete information. This is where _merge1 and _merge2 become useful.

tab _merge _merge | Freq. Percent Cum. ------------+----------------------------------- 1 | 1 0.50 0.50 2 | 1 0.50 1.00 3 | 198 99.00 100.00 ------------+----------------------------------- Total | 200 100.00

We can use _merge1 and _merge2 to check our merge more closely. When there is more than one dataset in the using statement, there will be one of these variables for each dataset in the usingstatement. The variables are assigned numbers based on the order of the datasets in the usingstatement. The variable label also indicates which dataset it is for. Here _merge1 indicates whether a case occurred in dataset data1m_a. _merge# is equal to 1 in the if the case in occurred in the corresponding dataset, and equal to 0 if it did not. Below we have used the tab command to look at _merge1. Recall from above that data1m_a had 197 cases, and 197 are equal to one for the variable _merge1. If the number of cases is correct (which we already know from _merge) you should almost always see what you expect to see.

tab _merge1 _merge | representin | g data1m_a | Freq. Percent Cum. ------------+----------------------------------- 0 | 3 1.50 1.50 1 | 197 98.50 100.00 ------------+----------------------------------- Total | 200 100.00

Crosstabulating the _merge variables (again using the tab command) is often a better check of what went of the merge process than oneway frequencies, like the above output. Below we show _merge crossed with _merge1. From this we can see that 196 cases are present in _merge1 and at least one of our other three datasets. Looking at the row for _merge=2 we see that one of the cases present in data1m_a was not present in the master dataset (i.e. data3m_a). The row for _merge=1 shows that one case is present in the master dataset and not present in data1m_a. Because _merge=1 indicates a case found only in the master dataset, this is logical. Although it can sometimes be difficult to figure out exactly what they tables are telling you, crosstabulating the _merge variables is probably the most effective way check that your data merged properly.

tab _merge _merge1 | _merge representing | data1m_a _merge | 0 1 | Total -----------+----------------------+---------- 1 | 1 0 | 1 2 | 0 1 | 1 3 | 2 196 | 198 -----------+----------------------+---------- Total | 3 197 | 200

One final note on the _merge variables, they are temporary, that means they will be discarded when you close Stata. If you wish to keep these variables you need to rename them using the command rename, or by telling merge to create them as permanent variables using the option _merge(varname) (this must be done at the time you run merge).

转载请注明:数据分析 » stata之数据合并merge_stata合并数据

关于xtdpdsys和xtabond2_stata xtabond

$
0
0

关于xtdpdsys和xtabond2

关键词:stata xtabondxtabond2、 stata xtabond2stata xtabond2命令

是检验扰动项的差分是否存在一阶与二阶自相关,以保证GMM的一致估计,一般而言扰动项的差分会存在一阶自相关,因为是动态面板数据,但若不存在二阶自相关或更高阶的自相关,则接受原假设“扰动项无自相关”。

Description

Linear dynamic panel-data models include p lags of the dependent variable as covariates and contain unobserved panel-level effects, fixed or random. By construction, the unobserved panel-level effects are correlated with the lagged dependent variables, making standard estimators inconsistent. Arellano and Bond (1991) derived a consistent generalized method-of-moments (GMM) estimator for the parameters of this model; xtabond implements this estimator.

This estimator is designed for datasets with many panels and few periods, and it requires that there be no autocorrelation in the idiosyncratic errors For a related estimator that uses additional moment conditions, but still requires no autocorrelation in the idiosyncratic errors, see [XT] xtdpdsys.  For estimators that allow for some autocorrelation in the idiosyncratic errors, at the cost of a more complicated syntax, see [XT] xtdpd.

1. xtdpdsys是stata10以后官方发布的命令,语法格式更为简洁;而xtabond2则是Roodman(2009)发布的个人编写的命令,语法格式较为繁复。

2. xtdpdsys可以通过pre()选项将部分解释变量设定为predetermined(前定变量),亦可通过endog()选项将部分解释变量设定为内生变量;而xtabond2则只能通过gmm()选项将部分解释变量设定为内生变量,并未能支持前定变量的设定;

3. xtdpdsys执行后无法直接报告sargan统计量和AR2统计量(需要进一步使用estat sargan和estat abond 来报告这两个统计量),而xtabond2则可以,且该命令会同时报告hansen统计量。

xtdpdsys or xtdpd is more concise way to write code for system GMM, but basically similar to xtabond2.

xtdpdsys or xtdpd can set the predetermined vars in “pre()” and endpgenous vars in “endog()”, but they do not report sargan test and AR(2), need to use “estat sargan” and “estat abond” to get the postestimation, but xtabond2 automatically report these.

Here are from stata website for their difference:
http://www.stata-press.com/manuals/stata10/xtintro.pdf

b. New estimation command xtdpdsys fits dynamic panel-data models by using the Arellano–Bover/Blundell–Bond system estimator. xtdpdsys is an extension of xtabond and produces
estimates with smaller bias when the AR process is too persistent. xtpdsys is also more efficient than xtabond. Whereas xtabond uses moment conditions based on the differenced
errors in producing results, xtpdsys uses moment conditions based on differences and levels.
See [XT] xtdpdsys.

c. New estimation command xtdpd fits dynamic panel-data models extending the Arellano–Bond or the Arellano–Bover/Blundell–Bond system estimator and allows a richer syntax for specifying models and so will fit a broader class of models then either xtabond or xtdpdsys. xtdpd can be used to fit models with serially correlated idiosyncratic errors,
whereas xtdpdsys and xtabond assume no serial correlation. xtdpd can be used with models where the structure of the predetermined variables is more complicated than that assumed by xtdpdsys or xtabond. See [XT] xtdpd.d.

New postestimation command estat abond tests for serial correlation in the first-differenced
errors. See [XT] xtabond postestimation, [XT] xtdpdsys postestimation, and [XT] xtdpd postestimation.

e. New postestimation command estat sargan performs the Sargan test of overidentifying restrictions. See [XT] xtabond postestimation, [XT] xtdpdsys postestimation, and [XT] xtdpd

clear

set more off

infile exp wks occ ind south smsa ms fem union ed blk lwage  ///

using “D:\软件培训资料\动态面板\aa.txt”

drop in 1

describe

summarize

generate person=group(595)

bysort person: generate period=group(7)

* panel data definition

xtset person period

xtdes

xtsum

generate exp2=exp^2

local x1 exp exp2 wks occ ind south smsa ms union

local x2 ed blk fem

* panel data regression: y=lwage

* x1=[1 exp exp2 wks occ ind south smsa ms union], 

* x2=[ed blk fem] (time-invariant regressors)

xtdpdsys lwage occ ind south smsa, lags(1) maxldep(3) vce(robust) ///

endogenous(ms union,lag(0,2)) pre(wks,lag(1,2)) twostep

estimates store ABB1

xtdpdsys lwage occ ind south smsa, lags(2) maxldep(3) vce(robust)  ///

endogenous(ms union,lag(0,2)) pre(wks,lag(1,2)) twostep

estimates store ABB2

xtdpdsys lwage occ ind south smsa, lags(3) maxldep(3) vce(robust)  ///

endogenous(ms union,lag(0,2)) pre(wks,lag(1,2)) twostep

estimates store ABB3

estimates table ABB1 ABB2 ABB3, b se t p

* hypothesis testing

quietly xtdpdsys lwage occ ind south smsa, lags(2) maxldep(3)      ///

endogenous(ms union,lag(0,2)) pre(wks,lag(1,2)) twostep artest(4)

estat abond   // test for autocorrelation

estat sargan  // test for IV overidentification

xtabond2 df age age2  ed12  nwe12 perd2 perd3 perd4 lnrtb3 ///

dna dnk dms dhrsw dhrsh dyu2, gmm(L.(lnrtb3 dms dna dnk dfu dyu2 dhrsh dhrsw), lag(3) collapse) ///

iv(age age2  edCol edColp ednoHS) twostep robust  ///

noconstant small   orthogonal  art(3)

*直接复制help中的例子

use http://www.stata-press.com/data/r7/abdata.dta

xtabond2 n l.n l(0/1).(w k) yr1980-yr1984, gmm(l.n w k) iv(yr1980-yr1984, passthru) noleveleq small

xtabond2 n l.n l(0/1).(w k) yr1980-yr1984, gmm(l.n w k) iv(yr1980-yr1984, mz) robust twostep small h(2)

xtabond2 n l(1/2).n l(0/1).w l(0/2).(k ys) yr1980-yr1984, gmm(l.n w k) iv(yr1980-yr1984) robust twostep

small

* Next two are equivalent, assuming id is the panel identifier

ivreg2 n cap (w = k ys rec) [pw=_n], cluster(ind) orthog(rec)

xtabond2 n w cap [pw=_n], iv(cap k ys, eq(level)) iv(rec, eq(level)) cluster(ind) h(1)

* Same for next two

regress n w k

xtabond2 n w k, iv(w k, eq(level)) small h(1)

* And next two, assuming xtabond updated since May 2004 with update command.

xtabond n yr*, lags(1) pre(w, lags(1,.)) pre(k, endog) robust small noconstant

xtabond2 n L.n w L.w k yr*, gmm(L.(w n k)) iv(yr*) noleveleq robust small

* And next two

xtdpd n L.n L(0/1).(w k) yr1978-yr1984, dgmm(w k n) lgmm(w k n) liv(yr1978-yr1984) vce(robust) two hascons

xtabond2 n L.n L(0/1).(w k) yr1978-yr1984, gmm(L.(w k n)) iv(yr1978-yr1984, eq(level)) h(2) robust twostep

* Three ways to reduce the instrument count

xtabond2 n L.n L(0/1).(w k) yr1978-yr1984, gmm(L.(w k n)) iv(yr1978-yr1984, eq(level)) h(2) robust twostep pca

xtabond2 n L.n L(0/1).(w k) yr1978-yr1984, gmm(L.(w k n), collapse) iv(yr1978-yr1984, eq(level)) h(2) robust twostep

xtabond2 n L.n L(0/1).(w k) yr1978-yr1984, gmm(L.(w k n), lag(1 1)) iv(yr1978-yr1984, eq(level)) h(2) robust twostep

广义矩估计(Generalized Method of Moments,即GMM)
一、解释变量内生性检验
首先检验解释变量内生性(解释变量内生性的Hausman 检验:使用工具变量法的前提是存在内生解释变量。Hausman 检验的原假设为:所有解释变量均为外生变量,如果拒绝,则认为存在内生解释变量,要用IV;反之,如果接受,则认为不存在内生解释变量,应该使用OLS。
reg ldi lofdi
estimates store ols
xtivreg ldi (lofdi=l.lofdi ldep lexr)
estimates store iv
hausman iv ols
(在面板数据中使用工具变量,Stata提供了如下命令来执行2SLS:xtivreg depvar [varlist1] (varlist_2=varlist_iv) (选择项可以为fe,re等,表示固定效应、随机效应等。详见help xtivreg)
如果存在内生解释变量,则应该选用工具变量,工具变量个数不少于方程中内生解释变量的个数。“恰好识别”时用2SLS。2SLS的实质是把内生解释变量分成两部分,即由工具变量所造成的外生的变动部分,以及与扰动项相关的其他部分;然后,把被解释变量对中的这个外生部分进行回归,从而满足OLS前定变量的要求而得到一致估计量。tptqtp
二、异方差与自相关检验
在球型扰动项的假定下,2SLS是最有效的。但如果扰动项存在异方差或自相关,
面板异方差检验:
xtgls enc invs exp imp esc mrl,igls panel(het)
estimates store hetero
xtgls enc invs exp imp esc mrl,igls
estimates store homo
local df = e(N_g) – 1
lrtest hetero homo, df(`df’)
面板自相关:xtserial enc invs exp imp esc mrl
则存在一种更有效的方法,即GMM。从某种意义上,GMM之于2SLS正如GLS之于OLS。好识别的情况下,GMM还原为普通的工具变量法;过度识别时传统的矩估计法行不通,只有这时才有必要使用GMM,过度识别检验(Overidentification Test或J Test):estat overid
三、工具变量效果验证
工具变量:工具变量要求与内生解释变量相关,但又不能与被解释变量的扰动项相关。由于这两个要求常常是矛盾的,故在实践上寻找合适的工具变量常常很困难,需要相当的想象力与创作性。常用滞后变量。
需要做的检验:
检验工具变量的有效性:
(1) 检验工具变量与解释变量的相关性
如果工具变量z与内生解释变量完全不相关,则无法使用工具变量法;如果与仅仅微弱地相关,。这种工具变量被称为“弱工具变量”(weak instruments)后果就象样本容量过小。检验弱工具变量的一个经验规则是,如果在第一阶段回归中,F统计量大于10,则可不必担心弱工具变量问题。Stata命令:estat first(显示第一个阶段回归中的统计量)
(2) 检验工具变量的外生性(接受原假设好)
在恰好识别的情况下,无法检验工具变量是否与扰动项相关。在过度识别(工具变量个数>内生变量个数)的情况下,则可进行过度识别检验(Overidentification Test),检验原假设所有工具变量都是外生的。如果拒绝该原假设,则认为至少某个变量不是外生的,即与扰动项相关。0H
Sargan统计量,Stata命令:estat overid
四、GMM过程
在Stata输入以下命令,就可以进行对面板数据的GMM估计。
. ssc install ivreg2 (安装程序ivreg2 )
. ssc install ranktest (安装另外一个在运行ivreg2 时需要用到的辅助程序ranktest)
. use “traffic.dta”(打开面板数据)
. xtset panelvar timevar (设置面板变量及时间变量)
. ivreg2 y x1 (x2=z1 z2),gmm2s (进行面板GMM估计,其中2s指的是2-step GMM)

The Sargan test is a statistical test used to check for over-identifying restrictions in a statistical model. It is also known as the Hansen test or J-Test for Overidentying restrictions. The Sargan test is based on the observation thatthe residuals should be uncorrelated with the set of exogenous variables if the instruments are truly exogenous. The Sargan test statistic can be calculated as TR² (the number of observations multiplied by the coefficient of determination) from the OLS regression of the residuals (from IV estimation) onto the set of exogenous variables.This statistic will be asymptotically chi-squared with m − k (where m is the number of Instruments and k is the number of endogenous variables) degrees of freedom under the null that the error term is uncorrelated with the instru

转载请注明:数据分析 » 关于xtdpdsys和xtabond2_stata xtabond


Stata中数据转换格式转换问题_stata数据格式转换

$
0
0

Stata中数据转换格式转换问题

关键词:stata数据格式转换stata 转换日期格式stata面板数据格式stata 数据类型转换stata转换成面板数据stata 日期数据格式

Stata中数据格式的转换问题一直让我很头疼。我有两个数据库,同一个yeartime变量,都显示为yyyymm的格式,例如200110.
但是在一个数据库中为long %12.0g,另一个数据库中为float %tmCCYYNN.我要用yeartime 这个变量对两个数据库进行合并。数据格式无论如何转化都不行。请高手教教我怎么处理呢?


精彩回答:

假设时间变量名为v1,显示为yyyymmdd的形式
对于v1是数字格式的情况,可用如下代码转换为Stata时间日期格式
gen year=int(v1/10000)
gen month=int((v1-year*10000)/100)
gen day=int((v1-year*10000-month*100))
gen date=mdy(month,day,year)
format date %td
对于v1是字符串格式的情况,可使用如下代码:
gen date=date(v1,”YMD”)
format date %td


追问:

谢谢你回复啊!我被这个日期格式搞疯了要。我输入%td格式,为什么一转成月份数据就直接变样子了啊。例如
01nov1999在%td格式下面,但是一转成%tm的就变成3172m6。为什么会有这种问题出现啊,请你教教我吧!谢谢啦!
还有,如果将日期型数据转化为数值型或者字符型该怎么能保证显示跟之前的日期型数据是一致的。我用你提到的类似的方法将月份型数据2012022转化为字符型的,g begin=string(year(yeartime)*10^2+month(yeartime,”%6.0f”),结果出来的是196109,不知道怎么回事啊!

你第一个问题,把日数据变成月度数据:
还是假设时间变量名为v1
gen ym=mofd(v1)
format ym %tm
如果变成年度数据;
gen Year=year(DateAnnounced)
如果变成季度数据:
gen yq=qofd(DateAnnounced)

对于你第二个问题,保证所有日期数据格式一直,最好的方法是吧各种格式的(字符串的,数字的)转变为Stata日期格式。

转载请注明:数据分析 » Stata中数据转换格式转换问题_stata数据格式转换

Stata中数据日期格式问题

$
0
0

Stata中数据日期格式问题

关键词: stata 日期数据格式stata 日期格式数据分析

参考:http://www.cdadata.com/16708

参考文档下载:http://pan.baidu.com/s/1c0H7M28
编者按:在涉及日期的数据处理中,特别是在不同软件之间相互读取数据时,常常使我们焦头烂额,为什么总得不到我们所要的结果。我认为,需要将各种格式的日期类型好好总结一下,特别是一些重要的函数,一定要烂记于心,这样才能得心应手。
要点:1.字符、数字、日期类型的转化技巧;
           2.众多的日期函数;
           3.字符型的截取函数运用。
Stata中数据格式的转换问题一直让我很头疼。我有两个数据库,同一个yeartime变量,都显示为yyyymm的格式,例如200110.
但是在一个数据库中为long .0g,另一个数据库中为float %tmCCYYNN.我要用yeartime 这个变量对两个数据库进行合并。数据格式无论如何转化都不行。请高手教教我怎么处理呢?
假设时间变量名为v1,显示为yyyymmdd的形式
对于v1是数字格式的情况,可用如下代码转换为Stata时间日期格式
gen year=int(v1/10000)
gen month=int((v1-year*10000)/100)
gen day=int((v1-year*10000-month*100))
gen date=mdy(month,day,year)
format date %td
对于v1是字符串格式的情况,可使用如下代码:
gen date=date(v1,”YMD”)
format date %td
谢谢你回复啊!我被这个日期格式搞疯了要。我输入%td格式,为什么一转成月份数据就直接变样子了啊。例如
01nov1999在%td格式下面,但是一转成%tm的就变成3172m6。为什么会有这种问题出现啊,请你教教我吧!谢谢啦!
还有,如果将日期型数据转化为数值型或者字符型该怎么能保证显示跟之前的日期型数据是一致的。我用你提到的类似的方法将月份型数据2012022转化为字符型的,g begin=string(year(yeartime)*10^2+month(yeartime,”%6.0f”),结果出来的是196109,不知道怎么回事啊!
你第一个问题,把日数据变成月度数据:
还是假设时间变量名为v1
gen ym=mofd(v1)
format ym %tm

如果变成年度数据;
gen Year=year(DateAnnounced)
如果变成季度数据:
gen yq=qofd(DateAnnounced)对于你第二个问题,保证所有日期数据格式一直,最好的方法是吧各种格式的(字符串的,数字的)转变为Stata日期格式。

http://www.stata.com/statalist/archive/2011-10/msg00329.html

Re: st: RE: dates
From   Nick Cox <<A href="mailto:njcoxstata@gmail.com">njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: RE: dates
Date   Sun, 9 Oct 2011 14:37:00 +0100
 As you say, and as I implied, "month" here  is in the sense of Stata
monthly date, as returmed by -mofd()-,  not month of year alone.
Nick
On Sun, Oct 9, 2011 at 1:48 PM, Steven Samuels <<A href="mailto:sjsamuels@gmail.com">sjsamuels@gmail.com> wrote:
> The one-step solution is very neat! I wasn't even aware of  of -dofm- . But substituting the month for -mofd- won't do because information on year is missing  In the original one-step formula, substitute for "visdate"  any date in the visit month, e.g. the first:
>
> day(dofm(1 + mofd(mdy(vmonth,1,vyear)))-1)
>
>
> Steve
>
>
> On Oct 7, 2011, at 8:01 AM, Nick Cox wrote:
>
> In this particular problem you don't have a daily date, but just use the month instead of the result of -mofd()-.
>
> Nick
> n.j.cox@durham.ac.uk
>
> Nick Cox
>
> Suppose -visdate- as here is a daily date variable.
>
> Then the length of the current month is given by the last day of the
> current month, which is given by the first day of the next month less
> 1.
>
> day(dofm(1 + mofd(visdate)) - 1)
>
> In steps:
>
> 1. current month is mofd(visdate)
> 2. next month is 1 + mofd(visdate)
> 3. first day of next month is dofm(1 + mofd(visdate))
> 4. last day of this month is dofm(1 + mofd(visdate)) - 1
> 5. day of last day ... you got it long since.
>
> But I never remember most of the function names and always have to
> look them up.
>
> It's key _never_ to type in rules about 28/31 or leap years, because
> Stata already knows.
>
> Nick
>
> On Thu, Oct 6, 2011 at 11:55 PM, Steven Samuels <<A href="mailto:sjsamuels@gmail.com">sjsamuels@gmail.com> wrote:
>> Oops! The original algorithm assigned days only from 1 to 15. The correction is below.  A better version  would assign days according to whether the month has 28, 29, 30, or 31 days, but I'll leave that to others.
>>
>>
>> Steve
>>
>>
>>
>> With enough missing dates  it might be better to randomly assign a day of the month, or you risk distorting the distribution of inter-visit intervals.
>>
>>
>>
>>
>> *********************************
>> clear
>> input str10 date
>> 200801
>> 20080113
>> end
>> set seed 21932
>> gen visdate = date(date, "YMD")
>> tempvar day
>> gen str2 `day' = string(ceil(30*runiform())) if length(date)==6
>> replace `day' = "0"+`day' if real(`day')<10
>> gen fakeday = (length(date)==6)
>> replace visdate = date(date + `day', "YMD") if length(date)==6
>> format visdate %td
>> list date visdate fakeday
>> *****************************
>>
>>
>>
>> On Oct 6, 2011, at 5:46 PM, Michael Eisenberg wrote:
>>
>> Thanks so much.
>>
>> On Thu, Oct 6, 2011 at 8:23 AM, Nick Cox <<A href="mailto:n.j.cox@durham.ac.uk">n.j.cox@durham.ac.uk> wrote:
>>> You don't say what "without success" means precisely.
>>>
>>> "200801" does not match either date pattern. If there is no information on day of month, Stata can only return missing for a daily date.
>>>
>>> -date("200801" + "15", "YMD")- seems to be the most common fudge. I would always tag such guessed dates with an indicator variable.
>>>
>>> Nick
>>> n.j.cox@durham.ac.uk
>>>
>>> Michael Eisenberg
>>>
>>> I have a list of visit dates for patients.  Unfortunately, the format
>>> is not constant.
>>>
>>> Most are listed with the year, month, day such as 20080105 for Jan 5,
>>> 2008 but some are listed only with the year and month 200801 for Jan
>>> 2008.
>>>
>>> I attempted to convert them into stata dates with the commands below
>>> without success.
>>>
>>> gen ndate = date(dx_date, "YMD")
>>> or
>>> gen ndate = date(dx_date, "CCYYNNDD")
>>>
>>> Can stata handle such inconsistent data?

代码示例:
clear
input str10 date
200801
20080113
end
set seed 21932
gen visdate = date(date, "YMD")
tempvar day
gen str2 `day' = string(ceil(30*runiform())) if length(date)==6
replace `day' = "0"+`day' if real(`day')<10
gen fakeday = (length(date)==6)
replace visdate = date(date + `day', "YMD") if length(date)==6
format visdate %td
list date visdate fakeday
gen ym=mofd(visdate)
format ym %tm
http://www.ssc.wisc.edu/sscc/pubs/stata_dates.htm

Working with Dates in Stata

Stata has many tools for working with dates. This article will introduce you to some of the most useful and easy to use features.
A Stata date is simply a number, but with the %td format applied Stata will interpret that number as "number of days since January 1, 1960." You can then use that number in a variety of ways. Stata has similar tools that measure time in terms of milliseconds, months, quarters, years and more. This article will focus on days, but if you know how to work with days you can quickly learn the others.
Often the first task is to convert the data you've been given into official Stata dates.

Converting Strings to Dates

If you've been given a date in string form, such as "November 3, 2010", "11/3/2010" or "2010-11-03 08:35:12" it can be converted using the date function. The date function takes two arguments, the string to be converted, and a series of letters called a "mask" that tells Stata how the string is structured. In a date mask, Y means year, M means month, D means day and # means an element should be skipped.
Thus the mask MDY means "month, day, year" and can be used to convert both "November 3, 2010" and "11/3/2010". A date like "2010-11-03 08:35:12" requires the mask YMD### so that the last three numbers are skipped. If you are interested in tracking the time of day you need to switch to the clock function and the %tc format so time is measured in milliseconds rather than days, but they are very similar.
To see this in action, type (or copy and paste) the following into Stata:
use http://www.ssc.wisc.edu/sscc/pubs/files/dates.dta
This is an example data set containing the above dates as dateString1,dateString2 and dateString3. To convert them to Stata dates do the following:
gen date1=date(dateString1,"MDY")
gen date2=date(dateString2,"MDY")
gen date3=date(dateString3,"YMD###")
Note that the mask goes in quotes.

Converting Numbers to Dates

Another common scenario gives you dates as three separate numeric variables, one for the year, one for the month and one for the day. The year, month and day variables in the example data set contain the same date as the others but in this format. To convert such dates to Stata dates, use the mdy function. It takes three numeric arguments: the month, day and year to be converted.
gen date4=mdy(month,day,year)

Formatting Date Variables

While the four date variables you've created are perfectly functional dates as far as Stata is concerned, they're difficult for humans to interpret. However, the %td format tells Stata to print them out as human readable dates:
format date1 %td
format date2 %td
format date3 %td
format date4 %td
This turns the 18569 now stored in all four variables into 03nov2010 (18,569 days since January 1, 1960) in all output. Try a list to see the result. If you remember your varlist syntax, you can do them all at once with:
format date? %td
You can have Stata output dates in different formats as well. For instructions type help dates and then click on the link Formatting date and time values.

Using Dates

Often your goal in creating a Stata date will be to create a time variable that can be included in a statistical command. If so, you can probably use it with no further modification. However, there are some common data preparation tasks involving dates.

Date Constants

If you need to refer to a particular date in your code, then in principle you could refer to it by number. However, it's usually more convenient to use the same functions used to import date variables. For example, the following are all equivalent ways of referring to November 3, 2010:
18569
date("November 3, 2010","MDY")
mdy(11,3,2010)
The td pseudofunction was designed for tasks like this and is somewhat more convenient to use. It takes a single argument (which cannot be a variable name) and converts it to a date on the assumption that the argument is a string containing a date in the format day, month, year. This matches the output of the %td format, e.g. 3nov2010. Thus the following is also equivalent:
td(3nov2010)
However, the following is not:
td(11/3/2010)
This will be interpreted as March 11, 2010, not November 3, 2010.

Extracting Date Components

Sometimes you need to pull out the components of a date. You can do so with the year,month and day functions:
gen year1=year(date1)
gen month1=month(date1)
gen day1=day(date1)

Before and After

Since dates are just numbers, before and after are equivalent to less than and greater than. Thus:
gen before2010=(date1
gen after2010=(date1>date(“January 1 2010″,”MDY”))

Durations and Intervals

Durations in days can be found using simple subtraction. The example data set contains the dates beginning and ending, and you can find out the duration of the interval between them with:

gen duration=ending-beginning

Durations in months are more difficult because months vary in length. One common approach is to ignore days entirely and calculate the duration solely from the year and month components of the dates involved:

gen durationInMonths=(year(ending)-year(beginning))*12+month(ending)-month(beginning)

Just keep in mind that this approach says January 31 and February 1 are one month apart, while January 1 and January 31 are zero months apart.

Date Arithmetic

If you need to add (or subtract) a period measured in days to a date, it is straightforward to do so. Just remember to format all new date variables as dates with %td:

gen tenDaysLater=date1+10
gen yesterday=date1-1
format %td tenDaysLater yesterday

If the period is measured in weeks, just multiply by 7. Months are again problematic since different months have different lengths. Years have the same problem if you need to be precise enough to care about leap years.

You can avoid this by building a new date based on the components of the old one, modified as required. The only trick is that you must handle year changes properly. For example, the following works properly:

gen oneMonthLater=mdy(month(date1)+1,day(date1),year(date1))
format %td oneMonthLater

oneMonthLater is now December 3, 2010. But the following does not:

gen twoMonthsLaterBad=mdy(month(date1)+2,day(date1),year(date1))
format %td twoMonthsLaterBad

This tries to set the month component of the new date to 13, which is invalid. It needs to be January of the next year instead. The following code will do allow you to add or subtract any number of months (just change the final number in the first line and the name of the new variable):

gen newMonth=month(date1)+2
gen newYear=year(date1)+floor((newMonth-1)/12)
replace newMonth=mod((newMonth-1),12)+1
gen twoMonthsLater=mdy(newMonth,day(date1),newYear)
format %td twoMonthsLater
drop newMonth newYear

If you need to do such things frequently you might want to turn this bit of code into a program, or even an ado file.

Learning More

To read the full documentation on Stata dates, type help dates and then click on thedates and times link at the top (the PDF documentation is much easier to read in this case). There you’ll learn to:

  • Work with times
  • Use intervals other than days, such as months, quarters or years
  • Create your own date format for output (e.g. November 3rd, 2010 rather than3nov2010)
  • Track leap seconds, in case you need to be extremely precise–you’ll also find an explanation of why such things exist

Last Revised: 11/9/2010

http://dss.princeton.edu/online_help/stats_packages/stata/time_series_data.htm

Time Series Data in Stata

Time series data and tsset

To use Stata’s time-series functions and analyses, you must first make sure that your data are, indeed, time-series. First, you must have a date variable that is in Stata date format. Secondly, you must make sure that your data are sorted by this date variable. If you have panel data, then your data must be sorted by the date variable within the variable that identifies the panel. Finally, you must use the tsset command to tell Stata that your data are time-series:

sort datevar tsset datevar

or

sort panelvar datevar tsset panelvar datevar

The first example tells Stata that you have simple time-series data, and the second tells Stata that you have panel data.

Stata Date Format

Stata stores dates as the number of elapsed days since January 1, 1960. There are different ways to create elapsed Stata dates that depend on how dates are represented in your data. If your original dataset already contains a single date variable, then use the date() function or one of the other string-date commands. If you have separate variables storing different parts of the date (month, day and year; year and quarter, etc.) then you will need to use the partial date variable functions.

Date functions for a single string date variable

Sometimes, your data will have the dates in string format. (A string variable is simply a variable containing anything other than just numbers.) Stata provides a way to convert these to time-series dates. The first thing you need to know is that the string must be easily separated into its components. In other words, strings like “01feb1990” “February 1, 1990” “02/01/90” are acceptable, but “020190” is not.

For example, let’s say that you have a string variable “sdate” with values like “01feb1990” and you need to convert it to a daily time-series date:

gen daily=date(sdate,"DMY")

Note that in this function, as with the other functions to convert strings to time-series dates, the “DMY” portion indicates the order of the day, month and year in the variable. Had the values been coded as “February 1, 1990” we would have used “MDY” instead. What if the original date only has two digits for the year? Then we would use:

gen daily=date(sdate,"DM19Y")

Whenever you have two digit years, simply place the century before the “Y.” If you have the last two digit years mixed, such as 1/2/98 and 1/2/00, use:

gen daily=date(sdate,"DMY",2020)

where 2020 is the largest year you have in your data set. Here are the other functions:

weekly(stringvar,”wy”)
monthly(stringvar,”my”)
quarterly(stringvar,”qy”)
halfyearly(stringvar,”hy”)
yearly(stringvar,”y”)

Note: Stata 10 uses upper case letters as DMY whereas earlier version of Stata uses lower case, dmy.

Date functions for partial date variables

Often you will have separate variables for the various components of the date; you need to put them together before you can designate them as proper time-series dates. Stata provides an easy way to do this with numeric variables. If you have separate variables for month, day and year then use the mdy() function to create an elapsed date variable. Once you have created an elapsed date variable, you will probably want to format it, as described below.

Use the mdy() function to create an elapsed Stata date variable when your original data contains separate variables for month, day and year. The month, day and year variables must be numeric. For example, suppose you are working with these data:

month day year
7 11 1948
1 21 1952
11 2 1994
8 12 1993

Use the following Stata command to generate a new variable named mydate:

gen mydate = mdy(month,day,year)

where mydate is an elapsed date varible, mdy() is the Stata function, and month, day, and year are the names of the variables that contain data for month, day and year, respectively.

If you have two variables, “year” and “quarter” use the “yq()” function:

gen qtr=yq(year,quarter) gen qtr=yq(1990,3)

The other functions are:

mdy(month,day,year) for daily data
yw(year, week) for weekly data
ym(year,month) for monthly data
yq(year,quarter) for quarterly data
yh(year,half-year) for half-yearly data

Converting a date variable stored as a single number

If you have a date variable where the date is stored as a single number of the form yyyymmdd (for example, 20041231 for December 31, 2004) the following set of functions will convert it into a Stata elapsed date.

gen year = int(date/10000)
gen month = int((date-year*10000)/100)
gen day = int((date-year*10000-month*100))
gen mydate = mdy(month,day,year)
format mydate %d

Time series date formats

Use the format command to display elapsed Stata dates as calendar dates. In the example given above, the elapsed date variable, mydate, has the following values, which represent the number of days before or after January 1, 1960.

month day year mydate
7 11 1948 -4191
1 21 1952 -2902
8 12 1993 12277
11 2 1994 12724

You can use the format command to display elapsed dates in a more customary way. For example:

format mydate %d

where mydate is an elapsed date variable and %d is the format which will be used to display values for that variable.

month day year mydate
7 11 1948 11jul48
1 21 1952 21jan52
8 12 1993 12aug93
11 2 1994 02nov94

Other formats are available to control the display of elapsed dates.

Time-series dates in Stata have their own formats similar to regular date formats. The main difference is that for a regular date format a “unit” or single “time period” is one day. For time series formats, a unit or single time period can be a day, week, month, quarter, half-year or year. There is a format for each of these time periods:

Format Description Beginning +1 Unit +2 Units +3 Units
%td daily 01jan1960 02jan1960 03Jan1960 04Jan1960
%tw weekly week 1, 1960 week 2, 1960 week 3, 1960 week 4, 1960
%tm monthly Jan, 1960 Feb, 1960 Mar, 1960 Apr, 1960
%tq quarterly 1st qtr, 1960 2nd qtr, 1960 3rd qtr, 1960 4th qtr, 1961
%th half-yearly 1st half, 1960 2nd half, 1960 1st half, 1961 2nd half, 1961
%ty yearly 1960 1961 1962 1963

You should note that in the weekly format, the year is divided into 52 weeks. The first week is defined as the first seven days, regardless of what day of the week it may be. Also, the last week, week 52, may have 8 or 9 days. For the quarterly format, the first quarter is January through March. For the half-yearly format, the first half of the year is January through June.

It’s even more important to note that you cannot jump from one format to another by simply re-issuing the format command because the units are different in each format. Here are the corresponding results for January 1, 1999, which is an elapsed date of 14245:

%td %tw %tq %th %ty
01jan1999 2233w50 5521q2 9082h2

These dates are so different because the elapsed date is actually the number of weeks, quarters, etc., from the first week, quarter, etc of 1960. The value for %ty is missing because it would be equal to the year 14,245 which is beyond what Stata can accept.

Any of these time units can be translated to any of the others. Stata provides functions to translate any time unit to and from %td daily units, so all that is needed is to combine these functions.

These functions translate to %td dates:

dofw() weekly to daily
dofm() monthly to daily
dofq() quarterly to daily
dofy() yearly to daily

These functions translate from %td dates:

wofd() daily to weekly
mofd() daily to monthly
qofd() daily to quarterly
yofd() daily to yearly

For more information see the Stata User’s Guide, chapter 27.

Specifying dates

Often we need to consuct a particular analysis only on observations that fall on a certain date. To do this, we have to use something called a date literal. A date literal is simply a way of entering a date in words and have Stata automatically convert it to an elapsed date. As with the d() literal to specify a regular date, there are the w(), m(), q(), h(), and y() literals for entering weekly, monthly, quarterly, half-yearly, and yearly dates, respectively. Here are some examples:

reg x y if w(1995w9) sum income if q(1988-3) tab gender if y(1999)

If you want to specify a range of dates, you can use the tin() and twithin() functions:

reg y x if tin(01feb1990,01jun1990) sum income if twithin(1988-3,1998-3)

The difference between tin() and twithin() is that tin() includes the beginning and end dates, whereas twithin() excludes them. Always enter the beginning date first, and write them out as you would for any of the d(), w(), etc. functions.

Time Series Variable Lists

Often in time-series analyses we need to “lag” or “lead” the values of a variable from one observation to the next. If we have many variables, this can be cumbersome, especially if we need to lag a variable more than once. In Stata, we can specify which variables are to be lagged and how many times without having to create new variables, thus saving alot of disk space and memory. You should note that the tsset command must have been issued before any of the “tricks” in this section will work. Also, if you have defined your data as panel data, Stata will automatically re-start the calculations as it comes to the beginning of a panel so you need not worry about values from one panel being carried over to the next.

L.varname and F.varname

If you need to lag or lead a variable for an analysis, you can do so by using the L.varname (to lag) and F.varname (to lead). Both work the same way, so we’ll just show some examples with L.varname. Let’s say you want to regress this year’s income on last year’s income:

reg income L.income

would accomplish this. The “L.” tells Stata to lag income by one time period. If you wanted to lag income by more than one time period, you would simply change the L. to something like “L2.” or “L3.” to lag it by 2 and 3 time periods, respectively. The following two commands will produce the same results:

reg income L.income L2.income L3.income
reg income L(1/3).income

D.varname

Another useful shortcut is D.varname, which takes the difference of income in time 1 and income in time 2. For example, let’s say a person earned $20 yesterday and $30 today.

Date income D.income D2.income
02feb1999 20 . .
02mar1999 30 10 .
02apr1999 45 15 5

So, you can see that D.=(income-incomet-1) and D2=(income-incomet-1)-(incomet-1-incomet-2)

S.varname

S.varname refers to seasonal differences and works like D.varname, except that the difference is always taken from the current observation to the nthobservation:

Date income S.income S2.income
02feb1999 20 . .
02mar1999 30 10 .
02apr1999 45 15 25

In other words: S.income=income-incomet-1 and S2.income=income-incomet-2

For more on lags, leads, differences and seasonal check the Time series 101 guide

转载请注明:数据分析 » Stata中数据日期格式问题

Tips 102:xtabond2 搞定动态面板

$
0
0

Tips 102:xtabond2 搞定动态面板

【问题】

有木有为了动态面板伤过神?

刚看到xtabond2的更新,新加入e(Ze)

Z’E where E=2nd-step residuals, used in computing Hansen statistic

【方法】

Stata有个命令xtabond2,作者是:David Roodman,

http://www.cgdev.org/content/expert/detail/2719/

这哥还写过abar\newey2\ivvif\collapse2等。

xtabond2的详细说明:

How to Do xtabond2:
An Introduction to “Difference” and “System” GMM in Stata

http://www.cgdev.org/files/11619_file_HowtoDoxtabond8_with_foreword.pdf

还有专门介绍的PPT:

repec.org/nasug2006/How2Do_xtabond2.ppt

【例子】

*直接复制help中的例子

use http://www.stata-press.com/data/r7/abdata.dta

xtabond2 n l.n l(0/1).(w k) yr1980-yr1984, gmm(l.n w k) iv(yr1980-yr1984, passthru) noleveleq small

xtabond2 n l.n l(0/1).(w k) yr1980-yr1984, gmm(l.n w k) iv(yr1980-yr1984, mz) robust twostep small h(2)

xtabond2 n l(1/2).n l(0/1).w l(0/2).(k ys) yr1980-yr1984, gmm(l.n w k) iv(yr1980-yr1984) robust twostep

small

* Next two are equivalent, assuming id is the panel identifier

ivreg2 n cap (w = k ys rec) [pw=_n], cluster(ind) orthog(rec)

xtabond2 n w cap [pw=_n], iv(cap k ys, eq(level)) iv(rec, eq(level)) cluster(ind) h(1)

* Same for next two

regress n w k

xtabond2 n w k, iv(w k, eq(level)) small h(1)

* And next two, assuming xtabond updated since May 2004 with update command.

xtabond n yr*, lags(1) pre(w, lags(1,.)) pre(k, endog) robust small noconstant

xtabond2 n L.n w L.w k yr*, gmm(L.(w n k)) iv(yr*) noleveleq robust small

* And next two

xtdpd n L.n L(0/1).(w k) yr1978-yr1984, dgmm(w k n) lgmm(w k n) liv(yr1978-yr1984) vce(robust) two hascons

xtabond2 n L.n L(0/1).(w k) yr1978-yr1984, gmm(L.(w k n)) iv(yr1978-yr1984, eq(level)) h(2) robust twostep

* Three ways to reduce the instrument count

xtabond2 n L.n L(0/1).(w k) yr1978-yr1984, gmm(L.(w k n)) iv(yr1978-yr1984, eq(level)) h(2) robust twostep pca

xtabond2 n L.n L(0/1).(w k) yr1978-yr1984, gmm(L.(w k n), collapse) iv(yr1978-yr1984, eq(level)) h(2) robust twostep

xtabond2 n L.n L(0/1).(w k) yr1978-yr1984, gmm(L.(w k n), lag(1 1)) iv(yr1978-yr1984, eq(level)) h(2) robust twostep

转载请注明:数据分析 » Tips 102:xtabond2 搞定动态面板

面板数据分析简要步骤与注意事项(面板单位根—面板协整—回归分析)

$
0
0

面板数据分析简要步骤与注意事项(面板单位根—面板协整—回归分析)

关键词:stata面板回归步骤面板数据分析面板数据分析步骤面板数据回归步骤面板数据回归模型步骤

步骤一:分析数据的平稳性(单位根检验)

       按照正规程序,面板数据模型在回归前需检验数据的平稳性。李子奈曾指出,一些非平稳的经济时间序列往往表现出共同的变化趋势,而这些序列间本身不一定有直接的关联,此时,对这些数据进行回归,尽管有较高的R平方,但其结果是没有任何实际意义的。这种情况称为称为虚假回归或伪回归(spurious regression)。他认为平稳的真正含义是:一个时间序列剔除了不变的均值(可视为截距)和时间趋势以后,剩余的序列为零均值,同方差,即白噪声。因此单位根检验时有三种检验模式:既有趋势又有截距、只有截距、以上都无。
       因此为了避免伪回归,确保估计结果的有效性,我们必须对各面板序列的平稳性进行检验。而检验数据平稳性最常用的办法就是单位根检验。首先,我们可以先对面板序列绘制时序图,以粗略观测时序图中由各个观测值描出代表变量的折线是否含有趋势项和(或)截距项,从而为进一步的单位根检验的检验模式做准备。
单位根检验方法的文献综述:在非平稳的面板数据渐进过程中,Levin andLin(1993) 很早就发现这些估计量的极限分布是高斯分布,这些结果也被应用在有异方差的面板数据中,并建立了对面板单位根进行检验的早期版本。后来经过Levin et al. (2002)的改进,提出了检验面板单位根的LLC 法。Levin et al. (2002) 指出,该方法允许不同截距和时间趋势,异方差和高阶序列相关,适合于中等维度(时间序列介于25~250 之间,截面数介于10~250 之间) 的面板单位根检验。Im et al. (1997) 还提出了检验面板单位根的IPS 法,但Breitung(2000) 发现IPS 法对限定性趋势的设定极为敏感,并提出了面板单位根检验的Breitung 法。Maddala and Wu(1999)又提出了ADF-Fisher和PP-Fisher面板单位根检验方法。
         由上述综述可知,可以使用LLC、IPS、Breintung、ADF-Fisher 和PP-Fisher5种方法进行面板单位根检验。
其中LLC-T 、BR-T、IPS-W 、ADF-FCS、PP-FCS 、H-Z 分别指Levin, Lin & Chu t* 统计量、Breitung t 统计量、lm Pesaran & Shin W 统计量、ADF- Fisher Chi-square统计量、PP-Fisher Chi-square统计量、Hadri Z统计量,并且Levin, Lin & Chu t* 统计量、Breitung t统计量的原假设为存在普通的单位根过程,lm Pesaran & Shin W 统计量、ADF- Fisher Chi-square统计量、PP-Fisher Chi-square统计量的原假设为存在有效的单位根过程, Hadri Z统计量的检验原假设为不存在普通的单位根过程。
       有时,为了方便,只采用两种面板数据单位根检验方法,即相同根单位根检验LLC(Levin-Lin-Chu)检验和不同根单位根检验Fisher-ADF检验(注:对普通序列(非面板序列)的单位根检验方法则常用ADF检验),如果在两种检验中均拒绝存在单位根的原假设则我们说此序列是平稳的,反之则不平稳。
如果我们以T(trend)代表序列含趋势项,以I(intercept)代表序列含截距项,T&I代表两项都含,N(none)代表两项都不含,那么我们可以基于前面时序图得出的结论,在单位根检验中选择相应检验模式。
       但基于时序图得出的结论毕竟是粗略的,严格来说,那些检验结构均需一一检验。具体操作可以参照李子奈的说法:ADF检验是通过三个模型来完成,首先从含有截距和趋势项的模型开始,再检验只含截距项的模型,最后检验二者都不含的模型。并且认为,只有三个模型的检验结果都不能拒绝原假设时,我们才认为时间序列是非平稳的,而只要其中有一个模型的检验结果拒绝了零假设,就可认为时间序列是平稳的。
       此外,单位根检验一般是先从水平(level)序列开始检验起,如果存在单位根,则对该序列进行一阶差分后继续检验,若仍存在单位根,则进行二阶甚至高阶差分后检验,直至序列平稳为止。我们记I(0)为零阶单整,I(1)为一阶单整,依次类推,I(N)为N阶单整。

步骤二:协整检验或模型修正

      情况一:如果基于单位根检验的结果发现变量之间是同阶单整的,那么我们可以进行协整检验。协整检验是考察变量间长期均衡关系的方法。所谓的协整是指若两个或多个非平稳的变量序列,其某个线性组合后的序列呈平稳性。此时我们称这些变量序列间有协整关系存在。因此协整的要求或前提是同阶单整。
      但也有如下的宽限说法:如果变量个数多于两个,即解释变量个数多于一个,被解释变量的单整阶数不能高于任何一个解释变量的单整阶数。另当解释变量的单整阶数高于被解释变量的单整阶数时,则必须至少有两个解释变量的单整阶数高于被解释变量的单整阶数。如果只含有两个解释变量,则两个变量的单整阶数应该相同。
     也就是说,单整阶数不同的两个或以上的非平稳序列如果一起进行协整检验,必然有某些低阶单整的,即波动相对高阶序列的波动甚微弱(有可能波动幅度也不同)的序列,对协整结果的影响不大,因此包不包含的重要性不大。而相对处于最高阶序列,由于其波动较大,对回归残差的平稳性带来极大的影响,所以如果协整是包含有某些高阶单整序列的话(但如果所有变量都是阶数相同的高阶,此时也被称作同阶单整,这样的话另当别论),一定不能将其纳入协整检验。
       协整检验方法的文献综述:(1)Kao(1999)、Kao and Chiang(2000)利用推广的DF和ADF检验提出了检验面板协整的方法,这种方法零假设是没有协整关系,并且利用静态面板回归的残差来构建统计量。(2)Pedron(1999)在零假设是在动态多元面板回归中没有协整关系的条件下给出了七种基于残差的面板协整检验方法。和Kao的方法不同的是,Pedroni的检验方法允许异质面板的存在。(3)Larsson et al(2001)发展了基于Johansen(1995)向量自回归的似然检验的面板协整检验方法,这种检验的方法是检验变量存在共同的协整的秩。
        主要采用的是Pedroni、Kao、Johansen的方法。
        通过了协整检验,说明变量之间存在着长期稳定的均衡关系,其方程回归残差是平稳的。因此可以在此基础上直接对原方程进行回归,此时的回归结果是较精确的。
这时,我们或许还想进一步对面板数据做格兰杰因果检验(因果检验的前提是变量协整)。但如果变量之间不是协整(即非同阶单整)的话,是不能进行格兰杰因果检验的,不过此时可以先对数据进行处理。引用张晓峒的原话,“如果y和x不同阶,不能做格兰杰因果检验,但可通过差分序列或其他处理得到同阶单整序列,并且要看它们此时有无经济意义。”
       下面简要介绍一下因果检验的含义:这里的因果关系是从统计角度而言的,即是通过概率或者分布函数的角度体现出来的:在所有其它事件的发生情况固定不变的条件下,如果一个事件X的发生与不发生对于另一个事件Y的发生的概率(如果通过事件定义了随机变量那么也可以说分布函数)有影响,并且这两个事件在时间上又有先后顺序(A前B后),那么我们便可以说X是Y的原因。考虑最简单的形式,Granger检验是运用F-统计量来检验X的滞后值是否显著影响Y(在统计的意义下,且已经综合考虑了Y的滞后值;如果影响不显著,那么称X不是Y的“Granger原因”(Granger cause);如果影响显著,那么称X是Y的“Granger原因”。同样,这也可以用于检验Y是X的“原因”,检验Y的滞后值是否影响X(已经考虑了X的滞后对X自身的影响)。
Eviews好像没有在POOL窗口中提供Granger causality test,而只有unit root test和cointegration test。说明Eviews是无法对面板数据序列做格兰杰检验的,格兰杰检验只能针对序列组做。也就是说格兰杰因果检验在Eviews中是针对普通的序列对(pairwise)而言的。你如果想对面板数据中的某些合成序列做因果检验的话,不妨先导出相关序列到一个组中(POOL窗口中的Proc/Make Group),再来试试。

       情况二:如果如果基于单位根检验的结果发现变量之间是非同阶单整的,即面板数据中有些序列平稳而有些序列不平稳,此时不能进行协整检验与直接对原序列进行回归。但此时也不要着急,我们可以在保持变量经济意义的前提下,对我们前面提出的模型进行修正,以消除数据不平稳对回归造成的不利影响。如差分某些序列,将基于时间频度的绝对数据变成时间频度下的变动数据或增长率数据。此时的研究转向新的模型,但要保证模型具有经济意义。因此一般不要对原序列进行二阶差分,因为对变动数据或增长率数据再进行差分,我们不好对其冠以经济解释。难道你称其为变动率的变动率?

步骤三:面板模型的选择与回归

      面板数据模型的选择通常有三种形式:
       一种是混合估计模型(Pooled Regression Model)。如果从时间上看,不同个体之间不存在显著性差异;从截面上看,不同截面之间也不存在显著性差异,那么就可以直接把面板数据混合在一起用普通最小二乘法(OLS)估计参数。一种是固定效应模型(Fixed Effects Regression Model)。如果对于不同的截面或不同的时间序列,模型的截距不同,则可以采用在模型中添加虚拟变量的方法估计回归参数。一种是随机效应模型(Random Effects Regression Model)。如果固定效应模型中的截距项包括了截面随机误差项和时间随机误差项的平均效应,并且这两个随机误差项都服从正态分布,则固定效应模型就变成了随机效应模型。
      在面板数据模型形式的选择方法上,我们经常采用F检验决定选用混合模型还是固定效应模型,然后用Hausman检验确定应该建立随机效应模型还是固定效应模型。
      检验完毕后,我们也就知道该选用哪种模型了,然后我们就开始回归:
      在回归的时候,权数可以选择按截面加权(cross-section weights)的方式,对于横截面个数大于时序个数的情况更应如此,表示允许不同的截面存在异方差现象。估计方法采用PCSE(Panel Corrected Standard Errors,面板校正标准误)方法。Beck和Katz(1995)引入的PCSE估计方法是面板数据模型估计方法的一个创新,可以有效的处理复杂的面板误差结构,如同步相关,异方差,序列相关等,在样本量不够大时尤为有用

转载请注明:数据分析 » 面板数据分析简要步骤与注意事项(面板单位根—面板协整—回归分析)

stata中面板数据异方差的处理_stata面板异方差检验

$
0
0

stata中面板数据异方差的处理

关键词: stata面板异方差检验stata处理面板数据用stata处理面板数据

一、前言

计算和互联网技术的广泛运用极大地提高了数据的可获得性,使大量的数据得以收集、保存和整理。与此同时,计量经济学在整个经济学体系中的地位日益提升。在顶级经济学杂志的论文中,应用计量论文已占到了相当高的比例。正是在这些背景之下,面板数据受到了越来越多经济研究人员的欢迎,面板数据的应用研究亦成为热点。

面板数据成为研究的热点一方面自然是因为本身优秀的特质;另一方面也归因于面板数据在应用过程中仍有许多问题和未知领域需要去探索。在面板数据回归分析中,如果存在异方差,最小二乘估计出的系数尽管是线性、无偏和一致的,但不是有效的,甚至不是渐进有效的。这些影响将导致参数估计和假设检验失效。

二、异方差产生的原因

异方差产生的因素很多,比如模型中省略了某些重要的解释变量,模型形式设定不准确,样本数据中存在的测量误差,异常值的出现,截面个体之间的差异等。面板数据是具有时序和截面双重性质的数据形式,异方差不仅会出现在时间序列上还将出现在横截面序列上,所以面板数据模型中的异方差问题要比单纯的时间序列或截面数据模型要复杂得多。

三、面板数据异方差处理方法

实际上,在处理面板数据线性回归时,主要考虑固定效应模型与pooled OLS的异方差问题。因为随机效应模型使用GLS估计,本身就已经控制了异方差。

Huber (1967)、Eicker (1967) 和 White (1980)提出了异方差—稳健方差矩阵估计,该方法能够在考虑异方差情况下求出稳健标准误。利用异方差稳健标准误对回归系数进行t检验和F检验都是渐近有效的。这就意味着,如果出现异方差,仍然可以使用OLS回归,只需结合使用稳健标准误即可。在STATA中,异方差—稳健标准误可以在“reg”或者“xtreg”语句后,加选择性命令“robust”即可得到。但是这一方法有一个假设的前提:残差项是独立分布的。

Parks(1967)提出了可行广义最小二乘法(FGLS),一般用于随机效应模型估计。基本思路是:先估计固定效应模型,得到〖个体误差项方差σ〗_ε^2 的估计值〖  σ ?〗_ε^2。继而估计混合OLS模型,利用其残差和第一步得到的〖  σ ?〗_ε^2,即可估计出总体误差项的方差σ ?_μ^2 。FGLS 估计量在N→∞或T→∞或二者都成立的情况下,都是渐进有效的。在STATA中,运用可行广义最小二乘法的命令是:xtgls。FGLS 要比“OLS+稳健标准误”处理异方差的方法更为有效,特别是在大样本的情况下。但是在更一般的情况下,“OLS+稳健标准误”比FGLS稳健,因为前者不需要估计条件方差函数的形式。

Beck and Katz (1995) 认为FGLS产生的标准误过小。为解决这一影响,他们提出了面板校正标准误(PCSE)来估计OLS的系数。在STATA中,带PCSE的pooled OLS可以由xtpcse获得。但是PCSE仅为T→∞时渐进有效的。当T/N 较小时,这一方法则不够精确。

Driscoll& Kraay (1998)提出了在N→∞的情况下渐近有效的非参数协方差矩阵估计方法,能够获得控制异方差和自相关的一致标准误,克服了PCSE在N→∞情况下不够准确的问题。在STATA中,获得Driscoll&Kraay 标准误的命令是xtscc。需要说明的是,xtscc只适用于估计pooled OLS和固定效应(组内)回归模型。

四、结论

通过以上比较分析可以看出,仅仅从方法上去比较处理异方差的方式孰优孰劣是不够的,还要结合样本情况、模型设置以及个人的追求偏好(如追求稳健或追求有效的偏好)进行选择。

参考文献:

[1] Huber, P. J. 1967. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,vol.1,221-233.Berkeley,CA: University of California Press.

[2] Eicker, F. 1967. Limit theorems for regressions with unequal and dependent errors. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, ed. L. Le Cam and J. Neyman, 59-82. Berkeley, CA: University of California Press.

[3] White, H. 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48: 817-838.

[4] Parks, R. 1967. Efficient estimation of a system of regression equations when disturbances are both serially and contemporaneously correlated. Journal of the American Statistical Association 62: 500-509.

[5] Beck, N., and J. N. Katz.1995. What to do (and not to do) with time-series cross-section data. American Political Science Review 89: 634-647.

[6] Driscoll,J.,andA.C.Kraay.1998. Consistent covariance matrix estimation with spatially dependent data. Review of Economics and Statistics 80: 549-560.

转载请注明:数据分析 » stata中面板数据异方差的处理_stata面板异方差检验

连玉君stata视频_stata教程_初级、高级、论文视频+讲义+数据

$
0
0

连玉君stata视频_stata教程_初级、高级、论文视频+讲义+数据

视频不加密,普通播放软件即可播放,也可在手机、Ipad上播放,非常方便使用!

价值6900元的连玉君初、高级培训、学术论文全套教程,现在只需要99元。

需要的请加微信:efenxi

(一)STATA初级视频教程

   5个专题,包含39个视频文件,总计40余个学时。内容涉及:STATA入门、数据处理、绘图、矩阵以及编程。

(二)STATA高级视频教程

  9讲,共48个视频文件,总计50余个学时

(三)STATA论文专题视频教程

       “Stata学术论文专题(视频教程)分为两个部分:第一部分精讲13篇论文,第二部分为写作技巧,包括61个视频文件

Stata软件在社会科学研究中的高级视频教程(ppt+配套数据+do

stata超值学习大礼包

 

转载请注明:数据分析 » 连玉君stata视频_stata教程_初级、高级、论文视频+讲义+数据

Viewing all 94 articles
Browse latest View live