DESMAT: Design Matrix Techniques for Categorical Models Using SAS/IML

Direct all comments to:

John Hendrickx

Click here to download a zip file containing the macros and a sample program.

Click here for a pdf version of this document (requires Adobe Acrobat Reader).

Contents

Introduction

This following is a description of DESMAT, a set of SAS macros for generating a design matrix using PROC IML. DESMAT allows different types of parametrizations, such as the difference contrast, helmert contrast, polynomial contrast, to be used for categorical variables in the model. The design matrix can be further manipulated outside DESMAT, then exported to a SAS dataset to be used in a SAS procedure. The macros are designed to modify the output of PROC GENMOD (Generalized Linear Models) to provide information on the type of parametrization and effect labels.

These macros were developed during the course of my PhD research, in order to develop special models for square tables (Hendrickx 1994). An account of its predecessor, GLIMMAT, was presented at SEUGI 92 (Hendrickx 1992).

DESMAT contains the following macros:

%GL
This macro is used in the data step to generate classification variables when tables data is being used.
%DESMAT
This macro is to be used within PROC IML. It creates a design matrix based on a model statement, with the option of specifying special parametrizations for each variable in each term. Parametrizations include the deviation contrast, simple contrast, difference contrast, polynomials, and user defined contrasts. The design matrix can be manipulated using IML commands, then exported and used within SAS procedures.
%EXPORT
Used in PROC IML to export the design matrix to a SAS data set. It also defines macro variables to facilitate the use of the design matrix in SAS procedures.
%PARMS
Contains the names of the variables created by %EXPORT, for use in a SAS procedure.
%RESULTS
For use in the model statement of PROC GENMOD. This uses PROC IML and the Output Delivery System options of PROC GENMOD to enhance the output with information on parametrization.

The macros are contained on the following files:

IMLDES.SAS
This program must be run once before using the %DESMAT macro for the first time. It creates a set of user defined IML modules in the storage library DESMAT.IMLDES.
DESMAT.SAS
Contains the %GL, %DESMAT, %EXPORT, %PARMS and %RESULTS macros.

The following sample program illustrates some of the features of DESMAT for creating restricted designs for square tables. Note that most of these models can be fitted much more simply by creating suitable variables in the datastep.

On parametrizations

When a categorical variable is incorporated in a model, a parameter is defined for each of its categories. Even in the simplest possible model of a grand mean and the effects of a single variable, the model is not identified because of linear dependencies in the design matrix. The model can be estimated nevertheless by usign a generalized inverse, which lets the computer program resolve the linear dependencies. A second method is to impose restrictions of some sort on the parameters of the categorical variables, and in this way remove the linear dependencies. This exposes the root of the problem: in order to identify the model, the parameters have to be redefined as being relative in some sense: relative to a certain reference category for example, or relative to the previous category. A model without such restrictions is in fact incorrectly stated: it asks about absolute rather than relative parameter values, and that question cannot be resolved.

The desired restrictions can be imposed using a contrast matrix (Bock 1975, Finn 1974). Using a contrast matrix, the identifiable parameters beta[i] are defined as linear combinations of the unidentifiable model parameters theta[j] (j = 1 to k, i < j) of a variable with k categories:

[1]: beta[i]=sum for j=1 to k {c[i,j]*theta[j]}

In matrix notation, this would be represented as:

[2]: ><strong>BETA</strong>=<strong>C</strong>*<strong>THETA</strong>
<p>(ignore this)

If A is a model matrix of less than full rank corresponding with the theta[j] parameters, and X is the full rank design matrix corresponding with the beta[i] parameters, then the following relationship will hold:

[3]: ><strong>A</strong>=<strong>X</strong>*<strong>C</strong>
<p>(ignore this)

Given a model matrix A and a contrast matrix C, the full rank design matrix X can therefore be derived by:

[4]: ><strong>X</strong> =
<strong>A</strong>*T(<strong>C</strong>)*inv(<strong>C</strong>*T(<strong>C</strong>))
<p>(ignore this)

For example, the simple contrast lets parameters be relative to a certain contrast category q. The identifiable parameters are equal to the corresponding model parameters, minus the model parameter for the contrast category:

[5]: beta[i]=theta[i]-theta[q], for i=1 to k, i ne q

In this case, equation (2) for a variable with 3 categories and letting the first category be the contrast category would be:

[6]: 
><pre>
|beta[2]|   |-1  1  0| |theta[1]|
|beta[3]| = |-1  0  1| |theta[2]|
                       |theta[3]|
</pre>
<p>(ignore this)

The %DESMAT macro starts by creating a model matrix for the variable in the model statement, using the IML function design. This is transformed into a full rank design matrix by creating a contrast matrix using the appropriate user defined function and using equation [4]. For interaction effects, a full rank design matrix with the specified parametrization is generated for each variable separately. The design matrix for the interaction term is the created using the IML function hdir (horizontal direct product).

The type of contrast used does not affect the model fit. The advantage of using contrasts is that parameters can be defined to be relative in a sense most suitable to the application. Of course, often individual parameters are not interpreted, and only the contribution of a term to the model fit is of interest. It is also possible in some cases to transform parameters with one contrast into parameters with another. This can be done for highest order effects; if lower order effects are to be transformed, then the values that are deducted from higher order effect parameters must be added to lower order effect parameters (Long 1984). Directly specifying the design matrix with the desired contrasts is a good deal simpler.

Several types of contrasts are:

deviation contrast
This is the usual ANOVA restriction, where the sum of parameters over a variable is equal to zero. For a single variable, each identifiable parameter is equal to its corresponding model parameter, minus the average of the model parameters. One parameter per factor is redundant, since the sum of the parameters is zero. This reference category parameter can be dropped for estimation. Parameters are relative to each other.
simple contrast
Each identifiable parameter is equal to its corresponding model parameter, minus the model parameter in a reference category. Parameters are relative to the reference category.
dummy contrast
This is the restriction used in the GLIM 3.77 program. A model parameter in the reference category is simply dropped. The remaining identifiable parameters are equal to the corresponding model parameters, and the corresponding columns of the design matrix are dummy variables. This contrast is equivalent to the simple contrast: the parameter estimates are identical, but the overall parameter (grand mean) is different. Parameters are relative to the reference category. The generalized inverse used by PROC GENMOD amounts to using the dummy contrast with the highest category as reference category.
backward difference contrast
Each identifiable parameter is equal to the corresponding model parameter, minus the model parameter for the previous category. This contrast will usually be used for factors with ordered categories. Note that in a loglinear model, interaction effects using this contrast are equal to log odds- ratios. Many models with restricted interaction terms used in mobility research are based on this model. Parameters are relative in the sense that only differences are examined.
forward difference contrast
Same as backward difference contrast, except that each identifiable parameter is equal to the corresponding model parameter, minus the model parameter for the next category.
helmert contrast
This contrast assumes ordered categories. Each identifiable parameter is equal to the corresponding model parameter, minus the average of the remaining categories. The last category is of course dropped. This contrast is orthogonal: the inner product of any two rows of the contrast matrix or any two columns of the design matrix is zero. Each parameter is relative to the mean of the remaining parameters.
orthogonal polynomial contrast
This contrast assumes ordered categories. The effects of a variable are decomposed into a linear effect, a quadratic effect, etc., so that the effects of a factor with r levels are modelled as a r-1 degree polynomial. The contrast matrix for the orthogonal polynomial contrast uses the IML function orpol, which generates the design matrix for a single factor plus the main effect. The first column of this matrix is removed, and its transpose is used as the contrast matrix.

The %GL macro

The %GL macro (generate levels) is for use in a data step. It generates a new variable with k levels, in blocks of b. The specification is:

variable=%gl(k,b);

The %GL macro can be used to generate the classifying variables if only a table is available. For example, to generate the variables for a 3 by 4 table:

%include desmat;
data;
input freq @@;
row=%gl(3,4);
col=%gl(4,1);
cards;
-----

The %DESMAT macro

The %DESMAT macro constructs a design matrix based on a model statement, using the parametrization specified for each factor. Note that PROC IML must be started first before invoking the %DESMAT macro and that the IML storage library DESMAT.IMLDES must be created before running %DESMAT for the first time. The %DESMAT macro has the following options:

%DESMAT(data=&syslast,model=,para=,intrcpt=NO,add=NO);

Usually, the data=, model= and para= options will be specified. An example of a %DESMAT statement:

proc iml;
%desmat(data=first,
        model=row col group row*col col*group,
        para=dev dev dif sim(3)*sim(5) dev(3)*orp(2));

The options that can be specified in the %DESMAT macro are:

data= The default is the most recently created SAS dataset.
model=
The model is specified as model=term1 term2 ... termn, using at least one space to separate the model terms. Each term can be a main effect of a single variable, or an interaction effect of two or more variables. Interaction effects are specfied by by joining the variables with asterisks, e.g. var1*var2. Variables may be character or numerical. The variables are assumed to be of the class type. A direct effect variable can be specified by using 'DIR' in the para= statement.

If no model is specified, then no design matrix is generated, but the IML storage library is set to 'DESMAT.IMLDES', and certain macros are defined.

para=
The specification of parametrizations is analogous to the specification of the model. A parametrization option in the 'para=' statement should correspond with each variable in each term of the model statement. Parametrization for an interaction term can be specified as contrast1*contrast2. Only the first three characters of a parametrization option are significant, but values may be longer. If a variable is of the direct type rather than the categorical type, then the parameterization can be specified as 'DIR'.

For some parametrizations, a reference category refcat can be specified within brackets. If refcat is ommitted then the first category is used. If the contrast value is invalid, then the deviation contrast is used, with the lowest category as reference category. If no contrast is specified, then the deviation contrast is used, with the highest category as reference category (in this case, the IML function designf is used).

The options for the para= statement are:

dir
Specifies that the variable is of the direct type. Its values are directly incorporated as a column of the design matrix.
dev(refcat)
Deviation contrast.
sim(refcat)
Simple contrast
dum(refcat)
Dummy variable contrast
dif(dir)
Difference contrast, backward version is default. If the first character of dir is an F then the forward difference contrast is used, otherwise the backward difference contrast is used.
hel
Helmert contrast
orp(degree/metric)
The orthogonal polynomial contrast. Two suboptions are possible, separated by a slash '/' (no spaces). Degree specifies the order of the polynomial to be used, with a maximum of r-1 for a factor with r levels. For example, the parametrization 'orp(1)' will model a linear effect. Metric should be the name of an IML row vector containing the metric to be used instead of the category numbers,
bas
A basic model matrix. The design matrix will not be of full rank, but can be manipulated before parameter estimation takes place.
use(matrix)
A user defined contrast, where matrix specifes the name of the contrast matrix. See below for details.
intrcpt=
Specifying 'intrcpt=yes' adds an intercept term to the design matrix. The default is 'intrcpt=no'.
add=
Specifying 'add=yes' will concatanate the new design matrix to an existing one. Otherwise, the existing design matrix will be replaced. The default is 'add=no'.

%DESMAT creates the following matrices:

DESIGN
An n by m matrix containing the design matrix X.
LABEL
An m by 1 vector containing descriptive labels for the parameter estimates.
PZATION
An m by 1 vector containing descriptive labels of the contrast type used each parameter estimate.

Specifying a user defined contrast

A user defined contrast can be included by specifying

use(matrix) or
use(matrix/labels/parlabs)

Matrix specifies the name of a valid contrast matrix. Labels specifies the name of a character vector or scalar containing parameter effect labels. If it is not specified then %DESMAT uses &var(#x), where &var is the variable name and #x is the number of the effect. Parlabs specifies the name of a character vector or scalar containing parametrizion type labels. If it is not specified, then %DESMAT uses use(#y), where y refers to the omitted categories.

The following example shows how to specify a user defined contrast:

* simple contrast for a variable with 5 categories;
* category 1 is used as reference category;
c1={-1  1  0  0  0,
    -1  0  1  0  0,
    -1  0  0  1  0,
    -1  0  0  0  1};
labs='col2':'col5';
pars='simple(1)';

%desmat(data=mobile,model=rowvar colvar,
        para=use(c1) use(c1/labs/pars));

%DESMAT will check if the user-defined contrast matrix exists and is numeric, whether it has the correct dimensions, and whether it is non-singular. If it does not fulfill these conditions, then the deviation contrast will be used.

Specifing a partial design matrix

In some cases, the user may want to enter a partial design matrix directly. The name of this matrix with the prefix '!' can be entered as a term in the model statement. For example:

* create a design matrix using the deviation contrast for "rowvar";
* last category is reference category;
use mobile;
read all var{rowvar};
mydes=designf(rowvar};

%desmat(data=mobile,model=!mydes colvar, para=x sim);

Note that a parametrization must be specified as a placeholder for the user defined design matrix, but that its name is ignored. The parametrization option can be used to specify the effect labels and parametrization labels. This is specified as:

foo(labels/parlabs)

The name of the parametrization foo can be anything. Labels specifies the name of a character vector or scalar containing parameter effect labels. If it is not specified then %DESMAT uses &des(#x), where &des is the name of the specified design matrix and x is the number of each effect. Parlabs specifies the name of a character vector or scalar containing parametrizion type labels. If it is not specified, the %DESMAT uses 'specified'. An example of specifying these labels:

use mobile;
read all var{rowvar};
mydes=designf(rowvar};
labs='row1':'row4';
pars='deviance(5)';

%desmat(data=mobile,model=!mydes colvar,
para=any(labs/pars) sim);

Modifying the design matrix

After %DESMAT has been run, the program remains in PROC IML. The design matrix can be manipulated in several ways. To create new model terms, the macros %CLASS and %INTRCT are available:

%class(var=,para=);
var=
Name of a variable in the current data set (required).
para=
A parametrization option (see above).

The macro %CLASS creates three matrices: DESCL, LABCL and PARCL, with the design matrix, effects labels, and parametriztion type label. Use 'run addeff(descl,labcl,parcl)' to add these to DESIGN, LABEL and PZATION respectively.

The macro %INTRCT creates an interaction term.

%intrct(termpar=,intx=0);
termpar=
Specifies an interaction term (two-way or greater).
intx=
Specifying a number other than zero lets a previously defined term interact with the terms specified by 'termpar='. This will rarely be used.

The macro %INTRCT also creates three matrices: DESIN, LABIN and PARIN.

Any design matrix, and corresponding effect and parametrization labels can be added to the matrices DESIGN, LABEL and PZATION created by %DESMAT, by specifying:

run addeff(newdes,newlab,newpar);

An effect can be fixed to zero by deleting the column of the design matrix for that parameter. An equality constraint can be imposed on two parameters by adding a column to another, then deleting it. Columns can be deleted from DESIGN by specifying:

run deleff(delvec);

Delvec is a vector containing the numbers of the columns of DESIGN (and the corresponding rows of LABEL and PZATION) which are to be removed.

If there is suspicion that the design matrix is not full rank, then this can be tested and linear dependencies removed by specifying:

run fullrank;

If DESIGN is of full rank, then nothing will happen. If it is not, then the module will report which effects have been removed.

The %EXPORT macro

The %EXPORT macro is to be used in PROC IML after a design matrix called 'DESIGN' has been created using %DESMAT. %EXPORT creates a data set from the design matrix, then exits IML and adds the input data set to the new data set. The design matrix is represented by the variables 'PRM1--PRM&ncols', where '&ncols' is a global macro variable defined by %EXPORT containing the number of columns in the design matrix.

Furthermore, the %EXPORT macro sets the global macro variables '&_print_' to 'off' and '&_disk_' to on. Procedures such as PROC GENMOD which use the Output Delivery System (ODS) will suspend printing output, but results will be saved and can be printed using PROC OUTPUT. The %RESULTS macro will create datasets with certain portions of the output, then print them using PROC IML.

The %EXPORT macro will usually be used without arguments. It does have the following options:

%EXPORT(storage=temp,data=desmat);

storage=
Defines an IML storage library in which currently defined matrices are stored, for use later by %RESULTS. This option allows a permanent library to be defined.
data=
Defines the data set to which the design matrix is to be exported.

If the user only wants to export the design matrix any other matrix, then this can also be done using the IML user defined module 'export':

run export(matrix,dataset); The 'export' module will export any 'matrix' to a 'dataset', representing the columns of the matrix by the variables 'PRM1- -PRM&NCOLS' and defining the macro variable '&ncols' as the number of columns the matrix has.

The %PARMS macro

The %PARMS macro is for use in a procedure after having run the %EXPORT macro. It simply inserts 'PRM1--PRM&ncols'. For example:

proc iml;
%desmat(data=mobile,model=row col);
%export;

proc genmod;
model freq=%parms / dist=poisson %results;

In a case in which the variables 'row' and 'col' both have 5 categories, the design matrix will have 8 columns and %PARMS will insert 'PRM1--PRM8'. The user can specify any subset of these variables, of course. Note that model terms can be specified before or after using %parms in a model statement.

The %RESULTS macro

The %RESULTS macro is for use in the model statement of PROC GENMOD only. It completes the model statement and specifies several options for additional output. It uses 'make' statements to create datasets with sections of output. It then starts PROC IML, opens the storage library defined by %EXPORT, and presents the results, depending on user specified print options. The %RESULTS macro will accept the following arguments:

%results(ndec=3,print=,output=,outest=);

If no arguments are specified, then the model information, model fit and parameter estimates will be printed.

ndec=
The number of decimals to be used in printing the results
print=
A number of options can be specified, separated by spaces.
NOMODINFO
Supresses printing the model information
ITER
Prints the iteration history (the last evaluation of the gradient and of the hessian are not printed).
NOMODFIT
Suppresses printing model fit information.
BIC
Enhances fit information with the BIC coefficient, the misclassification, and the percentage misclassification.
NOPARM
Suppresses printing the parameter estimates
FITTED
A selection of the observation statitics. The observed values, predicted values, (raw) residuals and deviance chi-square residuals are printed.
FITALL
All observation statistics.
COV
The covariance matrix of the parameter estimates.
COR
The correlation matrix of the parameter estimates.
DIFF
A matrix of differences between parameters.
STD
A matrix of the standard errors of these differences.
TRATIO
A matrix of the t-ratios of these differences.
PROB
A matrix of the probabilities of these differences, assuming normality.
output=
Name of a SAS dataset to which the observed statistics (obstats) are to be written. The dataset will contain the following four variables: "Yvar1", "Pred", "Resraw", "Resdev".
outest=
Name of a SAS dataset to which the parameter estimates (parmest) are to be written. The dataset will contain the following eight variables: "NR", "PARM", "DF", "ESTIMATE", "STDERR", "CHISQ", "PVAL".

Accessing model information

The %RESULTS macro creates the following data sets in 'MAKE' statements: 'modinfo', 'modfit', 'parmest', 'obstats', 'cov', 'corr' and 'iterparm' (the last two are only created in the relevant print= options have been specified). The %RESULTS macro does not convert any variables into permanent matrices, since the data sets remain available for the duration of the session. The parameter estimates are written to a new data set called 'parmest2'. If 'PRINT=BIC' has been specified, then the modified fit statistics are written to the data set 'bic'.

After the %RESULTS macro is completed, the program remains in PROC IML. Portions of the output can be manually reprinted using the following modules:

The number of decimals can be controlled by specifying the IML scalar '_NDEC_'. A quoted character string must be specified as argument for PARADIF. The differences between parameter estimates will be printed in any case, and if the string contains 'STD', 'TRA' or 'PRO' then the standard errors, t- ratios and probability values of these differences are printed as well (upper or lower case is ignored).

The module ANYSET does not accept arguments, since this would complicate passing on IML variables to other modules it calls. It is controlled by specifying the following IML variables:

_DSET_
The data set to be printed. The default is the last data set.
_HEAD_
A header. The default is 'Data set is' _dset_.
_COLS_
A vector containing the column numbers to be printed.
_ROWS_
A vector containing the row numbers to be printed.

For example, the %RESULTS macro prints the brief version of the observed statistics by:

_dset_='obstats';
_head_='Observation Statistics';
_cols_={1 2 8 10};
run anyset;

Note on PRINT=BIC

When the option 'PRINT=BIC' is specified, then the model fit statistics are enhanced with three extra measures: BIC, misclassification, and percentage misclassification. BIC (Bayesian Information Criterion) is alternative measure of fit developed by Raftery (1983) for large tables. BIC can be approximated by:

BIC=L**2-df*log(N)

Where 'L**2' is the deviance of the model, 'df' is the degrees of freedom, and 'N' is the sample size.

A BIC of zero or less is preferable to the saturated model. In a comparison of two or more models, the one with the lowest BIC is preferable.

Another method of evaluating the adequacy of a model is by examining the percentage of misclassification (delta). The amount of misclassification is calculated from the values of the dependent variable and the predicted valutes by:

misclass=sum(abs(depend-pred))/2;

Delta can then be calculated by:

delta=misclass*100/sum(depend);

References

Bock, R. Darrel. (1975).
"Multivariate statistical methods in behavioral research." New York: McGraw-Hill.
Finn, Jeremy D. (1974).
"A general model for multivariate analysis." New York: Holt, Rinehart and Winston, Inc.
Hendrickx, John. (1992).
"Using SAS macros and PROC IML to create special designs for Generalized Linear Models." Pp. 634- 655 in SAS Institute, SEUGI '92. Proceedings of the SAS European Users Group International Conference, May 19-22, 1992.
Hendrickx, John. (1994).
"The analysis of religious assortative marriage. An application of design matrix techniques for categorical models." Dissertation Nijmegen University. Amsterdam: Thesis Publishers.
Long, J. Scott. (1984).
"Estimable functions in log-linear models." Sociological Methods & Research 12:339-342.
Raftery, Adrian E. (1986).
"Choosing models for cross-classifications: comment on Grusky and Hauser, ASR, February 1984." American Sociological Review 51: 145-146.

[Home]