Backward-Incompatible Changes: Proposals

From Gretl
Jump to: navigation, search

Lately, there's been some discussion on syntax changes that would lead to backward incompatibility. We all agree that if compatibility has to be broken, it must be done once. Hence, we need to put out each idea anyone may have on modifications to current commands and functions. This page aims to gather all the ideas in one place. Everyone's invited to read, comment, modify, add (you need to register first).

The plan is, roughly, to put out 1.7.7 shortly to close a few bugs that have surfaced recently and then proceed to a backward-incompatible release (2.0?).

Contents

Commands

A few commands could be eliminated on the grounds that their effect can be replicated by means of simple scripts. The ones that have emerged so far from the discussion are:

  • arch
  • coint
  • hsk

Allin noticed that we shouldn't overlook the mapping from GUI menu selections to script commands, which is a nice feature of gretl (imperfect as it is in some cases). The argument, "You could do this in just a few lines of scripting so why keep it?" kind of presumes that gretl users write scripts. Certainly, script writing should be encouraged, but as of now many users, presumably, don't have a clue.

We could map from a GUI selection such as "Heteroskedasticity-corrected" to a sequence of appropriate script commands, and scrap 'convenience' commands such as hsk, but getting that right would require some thought.

An intermediate possibility (very similar to an idea suggested by Ignacio in the past) would be packaging a 'standard' set of scripts which get included by default at startup, so that, for example, the "arch" command would be replaced by the following arch function:

function arch(series y, scalar order, list X)
    ols y X
    u2 = $uhat^2
    ols u2 0 u2(-1 to -order)
    w = 1/$yhat
    wls w y X
end function

This is roughly what Stata does to implement a sizeable part of its vocabulary.

Another general issue that Allin raised in reference to "pvalue" is that we currently have several 'instructions' (for want of a better word) that exist in both 'command' and 'function' form. (Try calling up both the online command reference and the online function reference side by side). Allin and Jack would be quite happy to trash the commands in favour of the functions, but how user-friendly is this? One may surmise that using a function demands a 'programmer's approach'; more syntax has to be memorized.

The commands that fall into this category are:

  • dummify
  • lags
  • ldiff
  • orthdev
  • pvalue
  • sdiff

Furthermore, there are two special cases: we have a "logs" command but the nearest corresponding function is "log" and the keyword "var" that means two completely different things as a function and as a command. More thought on this is needed.

ar/ar1

These two commands could probably be merged into one. However, this would require writing new code, not just disbarring old commands. "ar1" has options to use Cochrane-Orcutt, Hildreth-Lu or Prais-Winsten, while the more general (with regard to lag structure) "ar" command is restricted to "generalized Cochrane-Orcutt". Simply merging "ar1" into "ar" would mean trashing everything but C-O. Perhaps none of these estimators are very wonderful, but some people may use them for pedagogical purposes, at least.

A possible solution could be moving the --hilu and --pwe options to "ar", with the understanding that those options are only valid when the ar order is 1.

arbond

We probably should not differentiate the dependent variable by default; in the future, a more general treatment of the whole matter will be needed, with the provision of orthogonal deviations and "System GMM" à la Blundell and Bond

arch

Arguably, this is not needed anymore. Anyone who wants to run a regression with ARCH innovations is probably better off by using the garch command instead. This command may have some pedagogical value, but then, a short script is probably even more effective.

Sven argued that this command could be scrapped if really identical results can be produced by garch. By this metric, we ought to keep it, since the results are not identical.

Howevever, all the "arch" command does is package into one instruction a sequence of operations that could be very easily replicated via a script. For example, the command

arch 1 y X

is functionally equivalent to

ols y X
u2 = $uhat^2
ols u2 0 u2(-1)
w = 1/$yhat
wls w y X

If the data generating process is

<math> y_t = x_t \beta + \varepsilon_t </math>
<math> \sigma^2_t = \alpha_0 + \sum_{i=1}^p \alpha_i \varepsilon_{t-i}^2 </math>

then, under a set of mild hypotheses, the resulting estimator is consistent but not efficient, compared to QMLE. So the benefits of a dedicated command are at least dubious.

arima

This was discussed in the mailing list; Jack's original proposal is reported below.

Basically, we would estimate models of the form <math>A(L) (y_t - x_t \beta) = z_t \gamma + C(L) \epsilon_t</math>

where

<math>A(L) = (1 -\phi_1 L-...-\phi_p L^p) (1-\Phi_1 L^s-...-\Phi_P L^{(sP)}) \Delta^d \Delta_s^D</math>

and

<math>C(L) = (1 + \theta_1 L + ... + \theta_q L^q) (1 + \Theta_1 L^s + ... \Theta_Q L^{(sQ)})</math>

with the syntax

arima p d q ; P D Q ; y Xlist ; Zlist

Note that the <math>A(L)</math> polynomial contains the difference operators and is applied to <math>y_t</math> and to <math>x_t</math>. This should cover all the concerns raised by Ignacio. By the way, we'd get rid of the unpleasant asymmetry between conditional ML (least squares) and full ML.

boxplot

It'd be nice to extend the boxplot command to the syntax

boxplot y ; x

where x is a discrete variable and show boxplots conditional on each values of x

As Sven noticed, this change would not be backward-incompatible incompatible per se, since

boxplot y

would still work. However, under the new syntax it would make little sense to use constructs like

boxplot y (x=0) y (x=1) 

so they may be deprecated.

chow

It would be nice to be able to specify, as the parameter, a dummy variable, so that it becomes easy to perform a Chow thest for, eg men/women and so on.

Sven's comment: Is this really backward-incompatible or just a new feature?

Allin: a new feature, I'd say, but a worthwhile one.

Allin, later: 'Tis done.

coeffsum

In some cases, it may be interesting to test the equality of the sum of certain coefficients to some nonzero vale, For example, in the ADL(1,1) model

<math>y_t = \alpha y_{t-1} + \beta_0 x_t + \beta_1 x_{t-1} + u_t</math>

the hypothesis <math>\alpha + \beta_0 + \beta_1 = 1</math> leads to the so-called homogeneous ECM. It'd be very handy to use

coeffsum 1 y_1 x x_1

On the other hand, the same could be accomplished without breaking backwards compatibility via a construct of the type

coeffsum y_1 x x_1 ; 1

Allin: I see no point in extending coeffsum, since we have the much more flexible "restrict". If there's any point in keeping coeffsum, it is as a very simple convenience command for people who might be put off by "restrict".

coint and coint2

The "coint" command could perhaps be eliminated, on the grounds of the "do-it-via-a-script" argument. In case we do, we may rename "coint2" to "coitest".

Some have expressed their disagreement to this: Sven pointed out that the Johansen test is asymptotically optimal if all assumptions about the DGP are met. This should include Gaussian innovations. But the Johansen test is known to have severe small-sample distortions. For this and other reasons it is perfectly reasonable to have a variety of cointegration tests. Clearly, our array of tests should be as large as possible (Andreas also agrees on this), although this should primarily be done through user-contributed function packages.

Moreover, Allin asked if it is established that the Johansen test dominates the Engle-Granger test. If not, having the E-G test available via a single command is probably worthwhile. Despite the fact that the Johansen test is generally thought to be the "industry standard", the E-G approach does have its merits (see a recent term papers writing service or research paper by Bayer and Hanck).

An alternative, suggested by Sven, is to consolidate 'coint' and 'coint2' into a single 'coint' command with options that determine which test would be used. So "coint" would do what coint2 does now, while the Engle-Granger syntax would become something like:

coint myvar1 myvar2 --eg

If it is decided that some additional cointegration test should be elevated from user-contributed status to a built-in thing, this would just become a new option:

coint myvar1 myvar2 --best-cointegration-test-ever

This would also open up the possibility of using the 'coint' command for panel data via panel cointegration test (think coint --ips or coint --lls) in the future.

criteria

Is this needed anymore? Jack suggested it can be removed; Sven agreed, arguing that much stuff should only be done via functions and/or accessors, not commands.

Allin agrees.

endif/endloop

We may want to deprecate these in favour of "end if" and "end loop". The advantages are unclear, though.

Allin: As of now we have in fact deprecated "end if" and "end loop" in favor of the versions without spaces. Rationale: if/endif and loop/endloop are nestable syntactical constructions (similar in spirit to if/fi, case/esac in bash). They can be distinguished from non-nestable blocks such as nle, mle, gmm, system and restrict. I feel that "if" and "loop" merit specific terminators, not just "end <something>". I'd vote for dropping support for "end if" and "end loop", to cut down on the checking-of-aliases overhead.

Jack: that would break almost all my existing scripts, but keeping the two alternate forms doesn't stand to reason, really. IMO Allin makes a very good case for dropping the "split" version.

garch

The parameters "p" and "q" should be swapped; the only reason why we've been using parameter 1 for the GARCH component and parameter 2 for the ARCH component is due to the idiosyncrasies of the code by Fiorentini et al on which our original garch routines were based.

We also ought to handle exogenous regressors for the variance equation. This could be done by allowing an extra list at the end of the command, so that the model

<math> y_t = \pi_0 + \pi_1 x_t + u_t </math>
<math> \sigma^2_t = \omega + \alpha u_{t-1}^2 + \beta \sigma^2_{t-1} + \lambda z_t </math>

would be handled by

garch 1 1 ; y const x ; z

Finally, we shouldn't include the constant by default in the mean equation, or provide a "--nc" option.

Who agrees: Sven

Allin writes: I've now implemented an "--nc" option for garch, and added a cli "--fcp" option to use the Fiorentini et al code. But I disagree about switching p and q. It's not only FCP, but also Bollerslev, Davidson and MacKinnon and many others who write GARCH(p,q) to indicate order p for the "G" component and order q for the regular ARCH. The list includes Wikipedia (FWIW) and most GARCH-related articles and lecture notes that I came across via JSTOR and googling respectively.

genr

This may be a bit radical, but it could be a good idea to get rid of the "genr" command in favour of its specialised aliases "scalar", "series" and so on, to encourage good programming practice.

Who agrees: Sven (in principle, but it _is_ radical)

graph/plot

These two commands could probably be merged into one.

Who agrees: Sven

hausman/hccm

Are these commands still necessary? The Hausman test is output automatically whenever needed (and can be retrieved via "$hausman"), while "hccm" could be subsumed under "ols --robust".

Who agrees: Sven

Allin: I'd happily get rid of hccm (currently it's an alias).

hsk

Jack argued that a simple script can do the same in a few lines and most users could live without this. Sven disagrees; furthermore, the general point raised by Allin on the GUI->script mapping applies.

lags

Nowadays, a number of syntax constructs are available such that this command is probably redundant. Perhaps it could be removed.

lmtest

This command has expanded through time and now comprises methods that are not, strictly speaking, LM test. It could be renamed to "diagtest" or something like that.

Who agrees: Sven

Allin: It's now called "modtest". I can't remember right now if "lmtest" is still an alias, but if so I'd happily get rid of it.

normtest/testuhat

Two more candidates for a merge. We could have that normtest without arguments applies implicitly to $uhat.

Who agrees: Sven (if I understand correctly)

restrict

The issue of nonlinear restrictions has popped out several times on the mailing lists. One possibility could be that, inside a "restrict" block, "b" becomes an alias for "$coeff" and the testing machine works more or less like this:

restrict
   [ genr statement ] = [ fixed value ]
   [ genr statement ] = [ fixed value ]
   ...
end restrict

That is, inside the restrict block we accept one or more lines. Each line must have the following structure: any valid genr statement which returns a scalar or a column vector (including user-written functions) on the left-hand side, an equal sign, and a fixed expression on the right, matching in dimension the lhs. Then, the Jacobian is computed numerically and the test is carried out via the delta method.

Possible problems:

  • It becomes difficult, if not impossible, to present the restricted model. In other words, "restrict" becomes essentially a testing device. This may become a problem in those cases when the main object of interest is not the test in itself, but rather the restricted model (as in simultaneous systems or VECMs)
  • What happens if a vector "b" already exists?
  • How do we handle errors in the genr statements?

This proposal combines a new feature with some elements of backward incompatibility, since under the new scheme

restrict
   b[1] = 1
end restrict

would keep working, but the alternative forms

restrict
   b1 = 1
end restrict

and

restrict
   b[myvar] = 1
end restrict

would not.

Allin: Is this still a live issue given the support we now have for nonlinear restrictions in "restrict"?

rhodiff

This command is a relic of a day when "genr" (or whatever we call it) was much more limited. Chances are, it can be safely removed.

Allin: And indeed, it has been removed.

summary

In many cases (notably, cross-section data), the number of missing observations can be even more important to know than the descriptive statistics themselves. True, we have

sum(ok(x))
sum(missing(x))

But it'd be nice to have that as the output of "summary" with more than one variable (with one variable, the number of valid cases is displayed in the header). The problem is, adding one item leads to revolutionise the output format. What should we do?

Functions

In this case, there are a few changes that could be made without breaking backwards compatibility if we allowed for optional arguments in built-in functions.

Various descriptive statistics

It may be worth considering to simplify our descriptive statistics offering, by treating matrices "by column":

sums and means

We currently have mean(), meanc() and meanr(). We may kill meanc() and meanr() and extend mean() to matrices. Same goes for sum() and sumc().

Of course, the equivalent to meanr(X) would become mean(X')', which is not nice.

second moments

It's probably cleaner to redefine cov() to allow series, lists or matrix arguments. In case the argument is a series, it would return a scalar; for lists and matrices, a matrix would be returned, thereby making cov() functionally equivalent to what mcov() does now with matrices. Clearly, in this case mcov() becomes redundant. A similar argument goes for corr().

By the same logic, sdc() could be incorporated into sd().

quantiles and order statistics

Again, by the same logic, minc(), maxc() etcetera may go. Moreover, nothing prevents us from extending median() and quantile() to matrices; note that this would not be backward-incompatible.

BFGSmax

It should be possible to provide analytical derivatives if available. The issue is: how?

Proposal: I think the easiest solution would be to extend the mle command with a parameter, i.e. --only-maximize. When this parameter is supplied to an mle block, then only the bfgs maximization is done, without calculating p-Values at the end. Analytical derivatives may then be supplied like in the ordinary mle block (Chris)

Allin: BFGSmax does now support analytical derivatives.

bkfilt/hpfilt

There was a time when gretl functions could not have more than one argument. So we have that the parameters to the Baxter-King and Hodrick-Prescott filters have to be adjusted via the "set" mechanism. It'd be much neater if one could write

hp_x = hpfilt(x, lambda)

and do away with the set variables "bkbp_limits", "bkbp_k" and "hp_lambda".

With optional arguments, backward compatibility would be retained, to some extent: for example, the HP filter could default to 100 * squared periodicity, so

hp_x = hpfilt(x)

would produce the same result as it does now, provided "hp_lambda" is set to "auto"

Who agrees: Sven

Also Allin. As of now we do have the required extra arguments to both hpfilt and bkfilt, so cleaning up redundant entries under "set" would be fine by me.

dummify

Another case for optional arguments:

X = dummify(x)

may return the same as now, but

X = dummify(x, 3)

would take "3" as the reference category

Allin: that's now done.

Syntax

quotation marks

there are a lot of places in gretl script language where strings are passed wthout quoting them, i.e. using " around the string. One example is the store command, where the filename argument can be supplied unquoted. If the filename is stored within a string variable (let's say in variable outputfile), then the variable can only be passed into the store command using a macro (@outputfile). I think that the usage of macros should be discouraged, as it makes reading script code harder. But currently the described problem can only be solved by using macros, so I propose that all strings in gretl script should be enclosed using ", otherwise they are treated as variable names. (Chris)

Personal tools