This is the last post of the series about GMM, short for Generalized Method of Moments in Econometrics and Statistics. Interested audience are welcomed to review my fist post Generalized Method of Moments (GMM) in R (Part 1 of 3) for the basic ideas of GMM and the application using R package gmm and the second post Generalized Method of Moments (GMM) in R (Part 2 of 3) for the illustration about QMLE. From these two posts one can find:

  • OLS and MLE estimators can be viewed as special cases of GMM estimators, and furthermore the MM estimators since the number of parameters is equal to the number of moment conditions. However, even OLS and MLE estimators are exactly equal to the MM estimators, one still needs to be careful with standard errors returned in the packages/software. Given the specific estimator, one can derive different sampling distributions (variances) based on distinct assumptions about data generation process. In general, such assumptions about data generation process are subtle and hidden, and the users are probably to neglect them when applying the packages/software. For example, OLS regression usually computes standard errors based on homoskedasticity while MLE estimation always computes standard errors based on the validity of the distributional assumptions by default in most of the packages/software. That’s why we need to make sure that such underlying assumptions are reliable in empirical studies. Of coz, the common strategy is to always indicate the robustness of the results, so the results with robust standard errors are usually presented in papers.
  • Without the knowledge of GMM (and the usage of gmm package), one can still obtain OLS and MLE estimators and the robust standard errors using specific packages, such as sandwich in R. Well, this is the case indeed to some degree. With the discussion of GMM, one can have a broader picture about the relations among estimators, but one can still just report the results of OLS and MLE by ignoring GMM. That is also the reason why GMM needs NOT be discussed in the first course Econometrics or probably any course especially for those PhD students who just wanna always rely on OLS and maybe some MLE. I believe brief discussion about causal inference may help to explain this issue. In this post, causal inference as the basic task of Econometrics and probably other scientific fields is briefly discussed at first. Secondly, instrumental variables (IV) and Two Stage Least Square (2SLS) are also discussed in the viewpoint of GMM.

Causal Inference in Empirical Study

Across disciplines of social science (e.g. Economics, Psychology, Marketing, Management, etc.), various theories are developed to study the causal inference of some variable \(x\) on the target variable \(y\). In my first textbook Economics (Samuelson, P., & Nordhaus, W., 2009), they illustrate the basic logic of Economics in the first Chapter and raise

Note: “Remember to hold other things constant when you are analyzing the impact of a variable on the economic system.”

In general, when constructing some theories only relevant variables and factors are discussed and included in the system whether such system is expressed as mathematical form or not. It does not mean the target variable \(y\) ONLY depends on the other variables within such system, but \(y\) may also depends on the variables within other theoretical frameworks. Therefore, one should be careful that even though the hypothesis derived from some theoretical framework only involves \(y\) and some other \(x\) other relevant vatiables NOT shown up in such hypothesis should be treated as constant. For example, in the law of demand saying that the demand for some normal good increases if its price decreases, the complete picture about the factors influencing the demand for some normal good involves also the individual’s income, the prices of other goods, etc.. However, they are NOT shown in the expression of the law of demand, but all should have consensus that they are held constant for the specified effect.

In empirical study to quantify or test some causal inference, all of the relevant factors or variables, in addition to the \(x\) shown in the hypothesis, to affect \(y\) should be considered theoretically. Here is the gentle instruction about the operationalized concept of causal inference in empirical study from Hansen’s textbook (Hansen, Bruce E., 2021). Suppose the following function to describe how \(y\) depends on its all determinants:

\[y=h\left(x_{1}, \boldsymbol{x}_{2}, \boldsymbol{u}\right)\]

where \(x_1\) is the observed focal variable to be studied for the causal inference, \(\boldsymbol{x}_2\) are observed other factors to affect \(y\), and \(\boldsymbol{u}\) are other unobserved factors to affect \(y\). Please note that \(\boldsymbol{u}\) are unobserved mainly due to (1) data availability or (2) the limitation of current theories. Therefore, such expression is quite general to describe all determinants to \(y\). The average causal effect (ACE) of \(x_1\) on \(y\) conditional on \(\boldsymbol{x}_2\) is

\[\begin{aligned} \operatorname{ACE}\left(x_{1}, \boldsymbol{x}_{2}\right) &=\mathbb{E}\left[C\left(x_{1}, \boldsymbol{x}_{2}, \boldsymbol{u}\right) \mid x_{1}, \boldsymbol{x}_{2}\right] \\ &=\int_{\mathbb{R}^{\ell}} \nabla_{1} h\left(x_{1}, \boldsymbol{x}_{2}, \boldsymbol{u}\right) f\left(\boldsymbol{u} \mid x_{1}, \boldsymbol{x}_{2}\right) d \boldsymbol{u} \end{aligned}\]

where \(f\left(\boldsymbol{u} \mid x_{1}, \boldsymbol{x}_{2}\right)\) is the conditional density of \(\boldsymbol{u}\) given \(x_1\) and \(\boldsymbol{x}_2\). Since \(\boldsymbol{u}\) always exist and unobserved, the usual treatment is to integrate out them for the conditional expectation. That is, average causal effect (ACE) is the operationalized concept in empirical study. Another highly relevant concept called Conditional Expectation Function (CEF) is also introduced:

\[\begin{aligned} m\left(x_{1}, \boldsymbol{x}_{2}\right) &=\mathbb{E}\left[h\left(x_{1}, \boldsymbol{x}_{2}, \boldsymbol{u}\right) \mid x_{1}, \boldsymbol{x}_{2}\right] \\ &=\int_{\mathbb{R}^{\ell}} h\left(x_{1}, \boldsymbol{x}_{2}, \boldsymbol{u}\right) f\left(\boldsymbol{u} \mid x_{1}, \boldsymbol{x}_{2}\right) d \boldsymbol{u} \end{aligned}\]

Applying the marginal effect operator, the derivative of CEF with respect to \(x_1\) is

\[\begin{aligned} \nabla_{1} m\left(x_{1}, \boldsymbol{x}_{2}\right) &=\int_{\mathbb{R}^{\ell}} \nabla_{1} h\left(x_{1}, \boldsymbol{x}_{2}, \boldsymbol{u}\right) f\left(\boldsymbol{u} \mid x_{1}, \boldsymbol{x}_{2}\right) d \boldsymbol{u} \\ &+\int_{\mathbb{R}^{\ell}} h\left(x_{1}, \boldsymbol{x}_{2}, \boldsymbol{u}\right) \nabla_{1} f\left(\boldsymbol{u} \mid x_{1}, \boldsymbol{x}_{2}\right) d \boldsymbol{u} \\ &=\operatorname{ACE}\left(x_{1}, \boldsymbol{x}_{2}\right)+\int_{\mathbb{R}^{\ell}} h\left(x_{1}, \boldsymbol{x}_{2}, \boldsymbol{u}\right) \nabla_{1} f\left(\boldsymbol{u} \mid x_{1}, \boldsymbol{x}_{2}\right) d \boldsymbol{u} \end{aligned}\]

It is interesting that the derivative of CEF, which is popular and discussed prevalent in Econometrics and Statistics, is NOT equal to ACE in general. However, if conditional on \(\boldsymbol{x}_2\), the random variables \(x_1\) and \(\boldsymbol{u}\) are statistically independent (Conditional Independence Assumption (CIA)) then

\[\nabla_{1} m\left(x_{1}, x_{2}\right)=\operatorname{ACE}\left(x_{1}, x_{2}\right)\]

the CEF derivative equals the average causal effect (ACE) for \(x_1\) on \(y\) conditional on \(\boldsymbol{x}_2\). If CIA can be justified, then the following derivation is quite satifactory and somehow amazing. We can define CEF error \(e\) as:

\[e=y-m(\boldsymbol{x})\]

By construction, this yields the formula:

\[y = m(\boldsymbol{x}) + e\]

The key property of the CEF error is that it has a conditional mean of zero, called conditional mean independence. This property is easily verified:

\[\begin{aligned} E[e \mid \boldsymbol{x}] &=\mathbb{E}[(y-m(\boldsymbol{x})) \mid \boldsymbol{x}] \\ &=\mathbb{E}[y \mid \boldsymbol{x}]-\mathbb{E}[m(\boldsymbol{x}) \mid \boldsymbol{x}] \\ &=m(\boldsymbol{x})-m(\boldsymbol{x}) \\ &=0 \end{aligned}\]

This fact can be combined with the law of iterated expectations to show that

\(E[e \mid \boldsymbol{x}]=0\) (conditional mean independence) \(\implies\) \(\mathbb{E}[\boldsymbol{x} e]=\mathbf{0}\)

That is, if \(E[e \mid \boldsymbol{x}]\) is assumed to be linear furthermore, then \(\boldsymbol{\beta}\) in \(m(\boldsymbol{x})=x_{1} \beta_{1}+x_{2} \beta_{2}+\cdots+\beta_{k}=\boldsymbol{x}^{\prime} \boldsymbol{\beta}\) are the projection coefficients in linear projection model. That is perfect news, since OLS estimators are the consistent estimators for such ACE.

One point should be noticed that it is neither unrealistic nor unscientific for the researchers to solely rely on OLS regression or some MLE for their empirical studies to quantify and test the ACE. However, two conditions should be clarified. From the above discussion, CIA is the key to justify the equality between CEF derivative and ACE. If specific control methods are implemented to guarantee CIA, then the researchers can always obtain such equality. The usual control methods are referring to experimental control methods, especially random assignment. Nowadays, studies in Marketing and (Behavioral) Economics usually borrow the experimental methodology usually employed in Psychology, such that CIA and thus the independence between focal \(x_1\) and other unobserved variables are supported. With random assignment they just break down the correlation between \(x_1\) and other unobserved and rule out the alternative explanations, such that the change of expected \(y\) can be safely attributed to the manipulation of \(x_1\). Secondly, linearity for CEF is another key assumption. Well, consistent with the experimental designs only several qualitative levels of focal \(x_1\) are discussed. Therefore, the exact functional form about the qualitative information seems not that important. Based on such situation, the training for the Ph.D students in several business fields relying on the experimental studies usually covers the basics of Econometrics and Statistics but includes heavy loads on the experimental designs. On the other hand, the hypothesis concerned in those fields usually involves the psychological constructs (e.g. anxiety, happiness, etc.) which are NOT directly captured by second-hand data. Therefore, the only way to conduct the empirical study on such hypothesis is to collect the first-hand data in lab. The situation becomes complicated for the researchers relying on second-hand data. In most of fields Economics, the hypothesis usually considers the observable variables only. Even though the underlying theories are built with some abstract terms, e.g. utility, the implied hypothesis should be observable. Nowadays, fields on Behavioral Economics, Mechanism Designs, and Game Theory also try employing lab experiments, but in most of the cases their focuses are still the observable but hardly quantified variables, e.g. some systems or mechanisms, not the psychological constructs directly. Therefore, most of the researchers in Economics need to deal with the second-hand data which are high dimensional and usually correlated to each other. That is, inequality between CEF derivative and ACE is the common case, and the second or third courses of Econometrics are always needed.

Instrumental Variables, 2 Stage Least Square, and GMM

Based on the theoretical frameworks, structural model or structural equation is the usual objective in empirical study, instead of CEF. The classical example is the study of the effect of education on the wage:

\[\log (\text {wage})= \beta \text{education }+e\]

with \(\beta\) the average causal effect of education on wage. Before rushing to the samples and OLS regression, one can perceive that if wage is affected by unobserved ability, and individuals with high ability self-select into higher education, then \(e\) contains unobserved ability, so education and \(e\) will be positively correlated. Such simple discussion just destroys CIA. That is, \(\beta\) here is NOT projection coefficient. We say that there is endogeneity in the linear model

\[y_{i}=\boldsymbol{x}_{i}^{\prime} \boldsymbol{\beta}+e_{i} \label{eq1} \tag{1}\]

if \(\boldsymbol{\beta}\) is the parameter of interest and \(\mathbb{E}\left[\boldsymbol{x}_{i} e_{i}\right] \neq \mathbf{0} \label{eq2} \tag{2}\)

Please note that endogeneity just precludes \(\eqref{eq1}\) to be linear CEF or linear projection model. Given \(y_i\) and \(\boldsymbol{x}_i\) in \(\eqref{eq1}\) we can nearly always cunstruct the linear projection model

\[\begin{aligned} y_{i} &=\boldsymbol{x}_{i}^{\prime} \boldsymbol{\beta}^{*}+e_{i}^{*} \\ \mathbb{E}\left[\boldsymbol{x}_{i} e_{i}^{*}\right] &=\boldsymbol{0} \end{aligned}\]

However, under endogeneity \(\eqref{eq2}\) the projection coefficient \(\boldsymbol{\beta}^{*}\) does not equal the structural parameter \(\boldsymbol{\beta}\):

\[\begin{aligned} \boldsymbol{\beta}^{*} &=\left(\mathbb{E}\left[\boldsymbol{x}_{i} \boldsymbol{x}_{i}^{\prime}\right]\right)^{-1} \mathbb{E}\left[\boldsymbol{x}_{i} y_{i}\right] \\ &=\left(\mathbb{E}\left[\boldsymbol{x}_{i} \boldsymbol{x}_{i}^{\prime}\right]\right)^{-1} \mathbb{E}\left[\boldsymbol{x}_{i}\left(\boldsymbol{x}_{i}^{\prime} \boldsymbol{\beta}+e_{i}\right)\right] \\ &=\boldsymbol{\beta}+\left(\mathbb{E}\left[\boldsymbol{x}_{i} \boldsymbol{x}_{i}^{\prime}\right]\right)^{-1} \mathbb{E}\left[\boldsymbol{x}_{i} e_{i}\right] \\ & \neq \boldsymbol{\beta} \end{aligned}\]

the final relation since \(\mathbb{E}\left[\boldsymbol{x}_{i} e_{i}\right] \neq \mathbf{0}\). Therefore, OLS estimator can ONLY be the consistent to the \(\boldsymbol{\beta}^{*}\) but NOT \(\boldsymbol{\beta}\).

For such endogeneity problem, more data called instrumental variables (IV) are needed. The basic idea is to use instrumental variables to substitute for the endogenous variable in the structural model. Two conditions are required for IV: (1) They are correlated to the endogenous variable for successful substitution. (2) They and other exogenous variables should be uncorrelated to the structural error \(e_i\) for the following moment conditions:

\[E\left\{\varepsilon_{i} z_{i}\right\}=E\left\{\left(y_{i}-x_{i}^{\prime} \beta\right) z_{i}\right\}=0\]

Suppose the total number of all IVs and exogenous variables is \(R\) and the number of unknown parameters is \(K\). If \(R=K\) then we back to the case of just-identified, similar to OLS regression. Thus, MM estimator is available to find the zero solution to the system of all \(R\) (\(K\)) sample moment conditions:

\[\frac{1}{N} \sum_{i=1}^{N}\left(y_{i}-x_{i}^{\prime} \hat{\beta}_{I V}\right) z_{i}=0\]

and we can obtain

\[\hat{\beta}_{I V}=\left(Z^{\prime} X\right)^{-1} Z^{\prime} y\]

which is called the IV estimator for \(\beta\). Given the validity of moment conditions (also the validity of IVs), it is consistent for \(\beta\). Sometimes, we can have more IVs than the number of endogenous variables so that \(R>K\), which is the case of over-identified. Therefore, the following quadratic form should be minimized for GMM estimator:

\[Q_{N}(\beta)=\left[\frac{1}{N} \sum_{i=1}^{N}\left(y_{i}-x_{i}^{\prime} \beta\right) z_{i}\right]^{\prime} W_{N}\left[\frac{1}{N} \sum_{i=1}^{N}\left(y_{i}-x_{i}^{\prime} \beta\right) z_{i}\right]\]

and the solution is

\[\hat{\beta}_{I V}=\left(X^{\prime} Z W_{N} Z^{\prime} X\right)^{-1} X^{\prime} Z W_{N} Z^{\prime} y\]

and it reduces to the above IV estimator when \(R=K\):

\[\begin{aligned} \hat{\beta}_{I V} &=\left(Z^{\prime} X\right)^{-1} W_{N}^{-1}\left(X^{\prime} Z\right)^{-1} X^{\prime} Z W_{N} Z^{\prime} y \\ &=\left(Z^{\prime} X\right)^{-1} Z^{\prime} y \end{aligned}\]

From the discussion of previous posts, it is reasonable to always employ the EfficientGMM whenever possible, and the the covariance matrix of the sample moments are:

\[\frac{1}{N} \sum_{i=1}^{N} \varepsilon_{i} z_{i}\]

which depends upon the assumptions we make about \(e_i\) and \(z_i\). The simplest (but not appealing) assumption is IID of \(e_i\), so

\[W_{N}^{o p t}=\left(\frac{1}{N} \sum_{i=1}^{N} z_{i} z_{i}^{\prime}\right)^{-1}=\left(\frac{1}{N} Z^{\prime} Z\right)^{-1}\]

and the resulting estimator is

\[\hat{\beta}_{I V}=\left(X^{\prime} Z\left(Z^{\prime} Z\right)^{-1} Z^{\prime} X\right)^{-1} X^{\prime} Z\left(Z^{\prime} Z\right)^{-1} Z^{\prime} y\]

This is the expression that is found in most textbooks. The estimator is sometimes referred to as the generalized instrumental variables estimator (GIVE) or the two-stage least squares or 2SLS estimator. However, more realistic and robust assumption for \(e_i\) should allow for heteroskedasticity, so that the optimal weighting matrix should be adjusted accordingly. Please just refer to Generalized Method of Moments (GMM) in R (Part 1 of 3) for detailed discussion. The vital point is that under heteroskedasticity EfficientGMM estimator is NOT equal to 2SLS estimator but EfficientGMM estimator is preferred in terms of efficiency. In the my Github specific examples are discussed. Suppose father’s education is selected as the IV for education, one can easily get IV estimator:

# 1 IV is provided
# IV estimator

iv_res <- mroz %>%
  ivreg(lwage ~ educ | fatheduc, data = .)

summary(iv_res)

and results are

Call:
ivreg(formula = lwage ~ educ | fatheduc, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0870 -0.3393  0.0525  0.4042  2.0677 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.44110    0.44610   0.989   0.3233  
educ         0.05917    0.03514   1.684   0.0929 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6894 on 426 degrees of freedom
Multiple R-Squared: 0.09344,	Adjusted R-squared: 0.09131 
Wald test: 2.835 on 1 and 426 DF,  p-value: 0.09294 

One can also get exactly the same results using gmm:

set.seed(1024)
gmm_iv_iid <- gmm(
  lwage ~ educ, # linear regression formula
  ~fatheduc, # formular to specify the instruments and exogenous variables
  rnorm(2),
  data = mroz,
  wmatrix = "optimal",
  vcov = "iid",
  optfct = "nlminb",
  control = list(eval.max = 10000)
)
summary(gmm_iv_iid)

and results are

Call:
gmm(g = lwage ~ educ, x = ~fatheduc, t0 = rnorm(2), wmatrix = "optimal", 
    vcov = "iid", optfct = "nlminb", data = mroz, control = list(eval.max = 10000))


Method:  twoStep 

Coefficients:
             Estimate  Std. Error  t value   Pr(>|t|)
(Intercept)  0.441103  0.445058    0.991114  0.321630
educ         0.059173  0.035060    1.687798  0.091450

J-Test: degrees of freedom is 0 
                J-test                P-value             
Test E(g)=0:    1.92112051932985e-30  ******* 

Similarly to the discussion of OLS, one can easily get the robust standard errors by

# Under heteroskedasticity
# IV estimator is preserved but
# White heteroskedasticity robust sd is applied
diag(vcovHC(iv_res, type = "HC0"))^0.5

or

# exactly the same as the gmm estimator with MDS assumed
set.seed(1024)
gmm_iv_mds <- gmm(
  lwage ~ educ, # linear regression formula
  ~fatheduc, # formular to specify the instruments and exogenous variables
  rnorm(2),
  data = mroz,
  wmatrix = "optimal",
  vcov = "MDS",
  optfct = "nlminb",
  control = list(eval.max = 10000)
)
summary(gmm_iv_mds)

and then

Call:
gmm(g = lwage ~ educ, x = ~fatheduc, t0 = rnorm(2), wmatrix = "optimal", 
    vcov = "MDS", optfct = "nlminb", data = mroz, control = list(eval.max = 10000))


Method:  twoStep 

Kernel:  Quadratic Spectral

Coefficients:
             Estimate  Std. Error  t value   Pr(>|t|)
(Intercept)  0.441103  0.464287    0.950067  0.342078
educ         0.059173  0.036943    1.601749  0.109211

J-Test: degrees of freedom is 0 
                J-test                P-value             
Test E(g)=0:    1.92112051932985e-30  *******  

Suppose another IV of mother’s education is added, then one can easily get 2SLS estimator

# More general case with 2 IVs: motheduc + fatheduc

tsls_res <- mroz %>%
  ivreg(lwage ~ educ + exper + expersq |
    exper + expersq + motheduc + fatheduc, data = .)

summary(tsls_res)

and results are


Call:
ivreg(formula = lwage ~ educ + exper + expersq | exper + expersq + 
    motheduc + fatheduc, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0986 -0.3196  0.0551  0.3689  2.3493 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.0481003  0.4003281   0.120  0.90442   
educ         0.0613966  0.0314367   1.953  0.05147 . 
exper        0.0441704  0.0134325   3.288  0.00109 **
expersq     -0.0008990  0.0004017  -2.238  0.02574 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6747 on 424 degrees of freedom
Multiple R-Squared: 0.1357,	Adjusted R-squared: 0.1296 
Wald test: 8.141 on 3 and 424 DF,  p-value: 2.787e-05 

Similarly, EfficientGMM under homoskedasticity is

set.seed(1024)
gmm_tsls_iid <- gmm(
  lwage ~ educ + exper + expersq, # linear regression formula
  ~exper + expersq + motheduc + fatheduc, # formular to specify the instruments and exogenous variables
  rnorm(5),
  data = mroz,
  wmatrix = "optimal",
  vcov = "iid",
  optfct = "nlminb",
  control = list(eval.max = 10000)
)
summary(gmm_tsls_iid)

and results are

Call:
gmm(g = lwage ~ educ + exper + expersq, x = ~exper + expersq + 
    motheduc + fatheduc, t0 = rnorm(5), wmatrix = "optimal", 
    vcov = "iid", optfct = "nlminb", data = mroz, control = list(eval.max = 10000))


Method:  twoStep 

Coefficients:
             Estimate     Std. Error   t value      Pr(>|t|)   
(Intercept)   0.04810031   0.39845299   0.12071764   0.90391468
educ          0.06139663   0.03128945   1.96221499   0.04973746
exper         0.04417039   0.01336956   3.30380314   0.00095383
expersq      -0.00089897   0.00039980  -2.24852479   0.02454275

J-Test: degrees of freedom is 1 
                J-test   P-value
Test E(g)=0:    0.37807  0.53864

Initial values of the coefficients
  (Intercept)          educ         exper       expersq 
 0.0481003069  0.0613966287  0.0441703929 -0.0008989696 

On the other hand, EfficientGMM under heteroskedasticity is

# However, more efficient estimator exist
# EfficientGMM under heteroskedasticity
set.seed(1024)
gmm_tsls_mds <- gmm(
  lwage ~ educ + exper + expersq, # linear regression formula
  ~exper + expersq + motheduc + fatheduc, # formular to specify the instruments and exogenous variables
  rnorm(5),
  data = mroz,
  wmatrix = "optimal",
  vcov = "MDS",
  optfct = "nlminb",
  control = list(eval.max = 10000)
)
summary(gmm_tsls_mds)

and results are

Call:
gmm(g = lwage ~ educ + exper + expersq, x = ~exper + expersq + 
    motheduc + fatheduc, t0 = rnorm(5), wmatrix = "optimal", 
    vcov = "MDS", optfct = "nlminb", data = mroz, control = list(eval.max = 10000))


Method:  twoStep 

Kernel:  Quadratic Spectral

Coefficients:
             Estimate     Std. Error   t value      Pr(>|t|)   
(Intercept)   0.04765346   0.42772970   0.11141022   0.91129106
educ          0.06105225   0.03316993   1.84059009   0.06568165
exper         0.04513614   0.01542081   2.92696239   0.00342290
expersq      -0.00093123   0.00042631  -2.18438828   0.02893373

J-Test: degrees of freedom is 1 
                J-test   P-value
Test E(g)=0:    0.44392  0.50524

Initial values of the coefficients
  (Intercept)          educ         exper       expersq 
 0.0481003069  0.0613966287  0.0441703929 -0.0008989696 

Summary

From such series about the discussion of GMM, I hope we can have a brief review on the Econometrics too. Nowadays, data scientists are highly valued in the job market, and in most of the time prediction is the main task for their daily works. Causal inference seem to be too luxury in industries, so why we still need to bother it or corresponding domain knowledge? Two aspects may answer this question. Firstly, feature engineering is always needed and domain knowledge is always helpful for the meaningful transformation of the data. Secondly, in the viewpoint of the project evolution new variables should be considered and the hints inspiring the new variables depends on the studies on related causal inference. I guess, Econometrics and Statistics are still the base of modern machine learning, what do you think?

References