Direct all comments to:
John HendrickxClick here to download a zip file containing the macros and a sample program.
Click here for a pdf version of this document (requires Adobe Acrobat Reader).
This following is a description of DESMAT, a set of SAS macros for generating a design matrix using PROC IML. DESMAT allows different types of parametrizations, such as the difference contrast, helmert contrast, polynomial contrast, to be used for categorical variables in the model. The design matrix can be further manipulated outside DESMAT, then exported to a SAS dataset to be used in a SAS procedure. The macros are designed to modify the output of PROC GENMOD (Generalized Linear Models) to provide information on the type of parametrization and effect labels.
These macros were developed during the course of my PhD research, in order to develop special models for square tables (Hendrickx 1994). An account of its predecessor, GLIMMAT, was presented at SEUGI 92 (Hendrickx 1992).
DESMAT contains the following macros:
The macros are contained on the following files:
The following sample program illustrates some of the features of DESMAT for creating restricted designs for square tables. Note that most of these models can be fitted much more simply by creating suitable variables in the datastep.
When a categorical variable is incorporated in a model, a parameter is defined for each of its categories. Even in the simplest possible model of a grand mean and the effects of a single variable, the model is not identified because of linear dependencies in the design matrix. The model can be estimated nevertheless by usign a generalized inverse, which lets the computer program resolve the linear dependencies. A second method is to impose restrictions of some sort on the parameters of the categorical variables, and in this way remove the linear dependencies. This exposes the root of the problem: in order to identify the model, the parameters have to be redefined as being relative in some sense: relative to a certain reference category for example, or relative to the previous category. A model without such restrictions is in fact incorrectly stated: it asks about absolute rather than relative parameter values, and that question cannot be resolved.
The desired restrictions can be imposed using a contrast
matrix (Bock 1975, Finn 1974). Using a contrast matrix, the
identifiable parameters
are defined as linear
combinations of the unidentifiable model parameters
(j = 1 to k, i < j) of a
variable with k categories:
[1]:
In matrix notation, this would be represented as:
[2]:
If A is a model matrix of less than full rank
corresponding with the
parameters, and X
is the full rank design matrix corresponding with the
parameters, then the following relationship will hold:
[3]:
Given a model matrix A and a contrast matrix C, the full rank design matrix X can therefore be derived by:
[4]:
For example, the simple contrast lets parameters be relative to a certain contrast category q. The identifiable parameters are equal to the corresponding model parameters, minus the model parameter for the contrast category:
[5]:
In this case, equation (2) for a variable with 3 categories and letting the first category be the contrast category would be:
[6]:
The %DESMAT macro starts by creating a model matrix for the variable in the model statement, using the IML function design. This is transformed into a full rank design matrix by creating a contrast matrix using the appropriate user defined function and using equation [4]. For interaction effects, a full rank design matrix with the specified parametrization is generated for each variable separately. The design matrix for the interaction term is the created using the IML function hdir (horizontal direct product).
The type of contrast used does not affect the model fit. The advantage of using contrasts is that parameters can be defined to be relative in a sense most suitable to the application. Of course, often individual parameters are not interpreted, and only the contribution of a term to the model fit is of interest. It is also possible in some cases to transform parameters with one contrast into parameters with another. This can be done for highest order effects; if lower order effects are to be transformed, then the values that are deducted from higher order effect parameters must be added to lower order effect parameters (Long 1984). Directly specifying the design matrix with the desired contrasts is a good deal simpler.
Several types of contrasts are:
The %GL macro (generate levels) is for use in a data step. It generates a new variable with k levels, in blocks of b. The specification is:
variable=%gl(k,b);
The %GL macro can be used to generate the classifying variables if only a table is available. For example, to generate the variables for a 3 by 4 table:
%include desmat; data; input freq @@; row=%gl(3,4); col=%gl(4,1); cards; -----
The %DESMAT macro constructs a design matrix based on a model statement, using the parametrization specified for each factor. Note that PROC IML must be started first before invoking the %DESMAT macro and that the IML storage library DESMAT.IMLDES must be created before running %DESMAT for the first time. The %DESMAT macro has the following options:
%DESMAT(data=&syslast,model=,para=,intrcpt=NO,add=NO);
Usually, the data=, model= and para= options will be specified. An example of a %DESMAT statement:
proc iml;
%desmat(data=first,
model=row col group row*col col*group,
para=dev dev dif sim(3)*sim(5) dev(3)*orp(2));
The options that can be specified in the %DESMAT macro are:
model=term1
term2 ... termn, using at least one
space to separate the model terms. Each term can be a
main effect of a single variable, or an interaction
effect of two or more variables. Interaction effects are
specfied by by joining the variables with asterisks, e.g.
var1*var2. Variables may be character or numerical. The
variables are assumed to be of the class type. A direct
effect variable can be specified by using 'DIR' in the
para= statement. If no model is specified, then no design matrix is generated, but the IML storage library is set to 'DESMAT.IMLDES', and certain macros are defined.
For some parametrizations, a reference category refcat can be specified within brackets. If refcat is ommitted then the first category is used. If the contrast value is invalid, then the deviation contrast is used, with the lowest category as reference category. If no contrast is specified, then the deviation contrast is used, with the highest category as reference category (in this case, the IML function designf is used).
The options for the para= statement are:
%DESMAT creates the following matrices:
A user defined contrast can be included by specifying
use(matrix) or
use(matrix/labels/parlabs)
Matrix specifies the name of a valid contrast matrix. Labels specifies the name of a character vector or scalar containing parameter effect labels. If it is not specified then %DESMAT uses &var(#x), where &var is the variable name and #x is the number of the effect. Parlabs specifies the name of a character vector or scalar containing parametrizion type labels. If it is not specified, then %DESMAT uses use(#y), where y refers to the omitted categories.
The following example shows how to specify a user defined contrast:
* simple contrast for a variable with 5 categories;
* category 1 is used as reference category;
c1={-1 1 0 0 0,
-1 0 1 0 0,
-1 0 0 1 0,
-1 0 0 0 1};
labs='col2':'col5';
pars='simple(1)';
%desmat(data=mobile,model=rowvar colvar,
para=use(c1) use(c1/labs/pars));
%DESMAT will check if the user-defined contrast matrix exists and is numeric, whether it has the correct dimensions, and whether it is non-singular. If it does not fulfill these conditions, then the deviation contrast will be used.
In some cases, the user may want to enter a partial design matrix directly. The name of this matrix with the prefix '!' can be entered as a term in the model statement. For example:
* create a design matrix using the deviation contrast for "rowvar";
* last category is reference category;
use mobile;
read all var{rowvar};
mydes=designf(rowvar};
%desmat(data=mobile,model=!mydes colvar, para=x sim);
Note that a parametrization must be specified as a placeholder for the user defined design matrix, but that its name is ignored. The parametrization option can be used to specify the effect labels and parametrization labels. This is specified as:
foo(labels/parlabs)
The name of the parametrization foo can be anything. Labels specifies the name of a character vector or scalar containing parameter effect labels. If it is not specified then %DESMAT uses &des(#x), where &des is the name of the specified design matrix and x is the number of each effect. Parlabs specifies the name of a character vector or scalar containing parametrizion type labels. If it is not specified, the %DESMAT uses 'specified'. An example of specifying these labels:
use mobile;
read all var{rowvar};
mydes=designf(rowvar};
labs='row1':'row4';
pars='deviance(5)';
%desmat(data=mobile,model=!mydes colvar,
para=any(labs/pars) sim);
After %DESMAT has been run, the program remains in PROC IML. The design matrix can be manipulated in several ways. To create new model terms, the macros %CLASS and %INTRCT are available:
%class(var=,para=);
The macro %CLASS creates three matrices: DESCL, LABCL and PARCL, with the design matrix, effects labels, and parametriztion type label. Use 'run addeff(descl,labcl,parcl)' to add these to DESIGN, LABEL and PZATION respectively.
The macro %INTRCT creates an interaction term.
%intrct(termpar=,intx=0);
The macro %INTRCT also creates three matrices: DESIN, LABIN and PARIN.
Any design matrix, and corresponding effect and parametrization labels can be added to the matrices DESIGN, LABEL and PZATION created by %DESMAT, by specifying:
run addeff(newdes,newlab,newpar);
An effect can be fixed to zero by deleting the column of the design matrix for that parameter. An equality constraint can be imposed on two parameters by adding a column to another, then deleting it. Columns can be deleted from DESIGN by specifying:
run deleff(delvec);
Delvec is a vector containing the numbers of the columns of DESIGN (and the corresponding rows of LABEL and PZATION) which are to be removed.
If there is suspicion that the design matrix is not full rank, then this can be tested and linear dependencies removed by specifying:
run fullrank;
If DESIGN is of full rank, then nothing will happen. If it is not, then the module will report which effects have been removed.
The %EXPORT macro is to be used in PROC IML after a design matrix called 'DESIGN' has been created using %DESMAT. %EXPORT creates a data set from the design matrix, then exits IML and adds the input data set to the new data set. The design matrix is represented by the variables 'PRM1--PRM&ncols', where '&ncols' is a global macro variable defined by %EXPORT containing the number of columns in the design matrix.
Furthermore, the %EXPORT macro sets the global macro variables '&_print_' to 'off' and '&_disk_' to on. Procedures such as PROC GENMOD which use the Output Delivery System (ODS) will suspend printing output, but results will be saved and can be printed using PROC OUTPUT. The %RESULTS macro will create datasets with certain portions of the output, then print them using PROC IML.
The %EXPORT macro will usually be used without arguments. It does have the following options:
%EXPORT(storage=temp,data=desmat);
If the user only wants to export the design matrix any other matrix, then this can also be done using the IML user defined module 'export':
run export(matrix,dataset); The 'export' module will export any 'matrix' to a 'dataset', representing the columns of the matrix by the variables 'PRM1- -PRM&NCOLS' and defining the macro variable '&ncols' as the number of columns the matrix has.
The %PARMS macro is for use in a procedure after having run the %EXPORT macro. It simply inserts 'PRM1--PRM&ncols'. For example:
proc iml; %desmat(data=mobile,model=row col); %export; proc genmod; model freq=%parms / dist=poisson %results;
In a case in which the variables 'row' and 'col' both have 5 categories, the design matrix will have 8 columns and %PARMS will insert 'PRM1--PRM8'. The user can specify any subset of these variables, of course. Note that model terms can be specified before or after using %parms in a model statement.
The %RESULTS macro is for use in the model statement of PROC GENMOD only. It completes the model statement and specifies several options for additional output. It uses 'make' statements to create datasets with sections of output. It then starts PROC IML, opens the storage library defined by %EXPORT, and presents the results, depending on user specified print options. The %RESULTS macro will accept the following arguments:
%results(ndec=3,print=,output=,outest=);
If no arguments are specified, then the model information, model fit and parameter estimates will be printed.
The %RESULTS macro creates the following data sets in 'MAKE' statements: 'modinfo', 'modfit', 'parmest', 'obstats', 'cov', 'corr' and 'iterparm' (the last two are only created in the relevant print= options have been specified). The %RESULTS macro does not convert any variables into permanent matrices, since the data sets remain available for the duration of the session. The parameter estimates are written to a new data set called 'parmest2'. If 'PRINT=BIC' has been specified, then the modified fit statistics are written to the data set 'bic'.
After the %RESULTS macro is completed, the program remains in PROC IML. Portions of the output can be manually reprinted using the following modules:
The number of decimals can be controlled by specifying the IML scalar '_NDEC_'. A quoted character string must be specified as argument for PARADIF. The differences between parameter estimates will be printed in any case, and if the string contains 'STD', 'TRA' or 'PRO' then the standard errors, t- ratios and probability values of these differences are printed as well (upper or lower case is ignored).
The module ANYSET does not accept arguments, since this would complicate passing on IML variables to other modules it calls. It is controlled by specifying the following IML variables:
For example, the %RESULTS macro prints the brief version of the observed statistics by:
_dset_='obstats';
_head_='Observation Statistics';
_cols_={1 2 8 10};
run anyset;
When the option 'PRINT=BIC' is specified, then the model fit statistics are enhanced with three extra measures: BIC, misclassification, and percentage misclassification. BIC (Bayesian Information Criterion) is alternative measure of fit developed by Raftery (1983) for large tables. BIC can be approximated by:
BIC=L**2-df*log(N)
Where 'L**2' is the deviance of the model, 'df' is the degrees of freedom, and 'N' is the sample size.
A BIC of zero or less is preferable to the saturated model. In a comparison of two or more models, the one with the lowest BIC is preferable.
Another method of evaluating the adequacy of a model is by examining the percentage of misclassification (delta). The amount of misclassification is calculated from the values of the dependent variable and the predicted valutes by:
misclass=sum(abs(depend-pred))/2;
Delta can then be calculated by:
delta=misclass*100/sum(depend);