Chapter 10

Multivariate analysis

Exploring relationships among three or more variables

In this chapter we shall be concerned with a variety of approaches to the examination of relationships when more than two variables are involved. Clearly, these concerns follow on directly from those of Chapter 8, in which we focused upon bivariate analysis of relationships. In the present chapter, we shall be concerned to explore the reasons for wanting to analyse three or more variables in conjunction; that is, why multivariate analysis is an important aspect of the examination of relationships among variables.

The basic rationale for multivariate analysis is to allow the researcher to discount the alternative explanations of a relationship that can arise when a survey/ correlational design has been employed. The experimental researcher can discount alternative explanations of a relationship through the combination of having a control group as well as an experimental group (or through a number of experimental groups) and random assignment (see Chapter 1). The absence of these characteristics, which in large part derives from the failure or inability to manipulate the independent variable in a survey/correlational study, means that a number of potentially confounding factors may exist. For example, we may find a relationship between people’s self-assigned social class (whether they describe themselves as middle or working class) and their voting preference (Conservative or Labour). But there are a number of problems that can be identified with interpreting such a relationship as causal. Might the relationship be spurious? This possibility could arise because people on higher incomes are both more likely to consider themselves middle class and to vote Conservative. Also, even if the relationship is not spurious, does the relationship apply equally to young and old? We know that age affects voting preferences, so how does this variable interact with self-assigned social class in regard to voting behaviour? Such a finding would imply that the class–voting relationship is moderated by age. The problem of spuriousness arises because we cannot make some people think they are middle class and others working class and then randomly assign subjects to the two categories. If we wanted to carry out an experimental study to establish whether a moderated relationship exists whereby age moderates the class–voting relationship, we would use a factorial design (see Chapter 9). Obviously, we are not able to create such experimental conditions, so when we investigate this kind of issue through surveys, we have to recognise the limitations of inferring causal relationships from our data. In each of the two questions about the class–voting relationship, a third variable – income and age respectively – potentially contaminates the relationship and forces us to be sceptical about it.

The procedures to be explained in this chapter are designed to allow such contaminating variables to be discounted. This is done by imposing ‘statistical controls’ which allow the third variable to be ‘held constant’. In this way we can examine the relationship between two variables by partialling out and thereby controlling the effect of a third variable. For example, if we believe that income confounds the relationship between self-assigned social class and voting, we examine the relationship between social class and voting for each income level in our sample. The sample might reveal four income levels, so we examine the class– voting relationship for each of these four income levels. We can then ask whether the relationship between class and voting persists for each income level or whether it has been eliminated for all or some of these levels. The third variable (that is, the one that is controlled) is often referred to as the test factor (see, for example, Rosenberg, 1968), but the term test variable is preferred in the following discussion.

The imposition of statistical controls suffers from a number of disadvantages. In particular, it is only possible to control for those variables which occur to you as potentially important and which are relatively easy to measure. Other variables will constitute further contaminating factors, but the effects of these are unknown. Further, the time order of variables collected by means of a survey/ correlational study cannot be established through multivariate analysis, but has to be inferred. In order to make inferences about the likely direction of cause and effect, the researcher must look to probable directions of causation (for example, education precedes current occupation) or to theories which suggest that certain variables are more likely to precede others. As suggested in Chapter 1, the generation of causal inferences from survey/correlational research can be hazardous, but in the present chapter we shall largely side-step these problems which are not capable of easy resolution in the absence of a panel study.

The initial exposition of multivariate analysis will emphasise solely the examination of three variables. It should be recognised that many examples of multivariate analysis, particularly those involving correlation and regression techniques, go much further than this. Many researchers refer to the relationship between two variables as the zero-order relationship; when a third variable is introduced, they refer to the first-order relationship, that is the relationship between two variables when one variable is held constant; and when two extra variables are introduced, they refer to the second-order relationship , when two variables are held constant. MULTIVARIATE ANALYSIS THROUGH CONTINGENCY TABLES

In this section, we shall examine the potential of contingency tables as a means of exploring relationships among three variables. Four contexts in which such analysis can be useful are provided: testing for spuriousness; testing for intervening variables; testing for moderated relationships; and examining multiple causation. Although these four notions are treated in connection with contingency-table analysis, they are also relevant to the correlation and regression techniques which are examined later.

Testing for spuriousness

The idea of spuriousness was introduced in Chapter 1 in the context of a discussion about the nature of causality. In order to establish that there exists a relationship between two variables it is necessary to show that the relationship is non-spurious. A spurious relationship exists when the relationship between two variables is not a ‘true’ relationship, in that it only appears because a third variable (often called an extraneous variable – see Rosenberg, 1968) causes each of the variables making up the pair. In Table 10.1 a bivariate contingency table is presented which derives from an imaginary study of 500 manual workers in twelve firms. The table seems to show a relationship between the presence of variety in work and job satisfaction. For example, 80 per cent of those performing varied work are satisfied, as against only 24 per cent of those whose work is not varied. Thus there is a difference ( d1) of 56 per cent (that is, 80 − 24) between those performing varied work and those not performing varied work in terms of job satisfaction. Contingency tables are not normally presented with the differences between cells inserted, but since these form the crux of the multivariate contingency table analysis, this additional information is provided in this and subsequent tables in this section.

Table 10.1 Relationship between work variety and job satisfaction (imaginary data)

Figure 10.1 Is the relationship between work variety and job satisfaction spurious?

Could the relationship between these two variables be spurious? Could it be that the size of the firm (the test variable) in which each respondent works has ‘produced’ the relationship (see Figure 10.1)? It may be that size of firm affects both the amount of variety of work reported and levels of job satisfaction. In order to examine this possibility, we partition our sample into those who work in large firms and those who work in small firms. There are 250 respondents in each of these two categories. We then examine the relationship between amount of variety in work and job satisfaction for each category. If the relationship is spurious we would expect the relationship between amount of variety in work and job satisfaction largely to disappear. Table 10.2 presents such an analysis. In a sense, what one is doing here is to present two separate tables: one examining the relationship between amount of variety in work and job satisfaction for respondents from large firms and one examining the same relationship for small firms. This notion is symbolised by the double line separating the analysis for large firms from the analysis for small firms.

What we find is that the relationship between amount of variety in work and job satisfaction has largely disappeared. Compare d1 in Table 10.1 with both d1 and d2 in Table 10.2. Whereas d1 in Table 10.1 is 56 per cent, implying a large difference between those whose work is varied and those whose work is not varied in terms of job satisfaction, the corresponding percentage differences in Table 10.2 are 10

Table 10.2 A spurious relationship: the relationship between work variety and job satisfaction, controlling for size of firm (imaginary data)

and 11 per cent for d1 and d2 respectively. This means that when size of firm is controlled, the difference in terms of job satisfaction between those whose work is varied and those whose work is not varied is considerably reduced. This analysis implies that there is not a true relationship between variety in work and job satisfaction, because when size of firm is controlled the relationship between work variety and job satisfaction is almost eliminated. We can suggest that size of firm seems to affect both variables. Most respondents reporting varied work come from large firms ([cell1 + cell5] − [cell3 + cell7]) and most respondents who are satisfied come from large firms ([cell1 + cell2] − [cell3 + cell4]).

What would Table 10.2 look like if the relationship between variety in work and job satisfaction was not spurious when size of firm is controlled? Table 10.3 presents the same analysis but this time the relationship is not spurious. Again, we can compare d1 in Table 10.1 with both d1 and d2 in Table 10.3. In Table 10.1, the difference between those who report variety in their work and those who report no variety is 56 per cent (that is, d1), whereas in Table 10.3 the corresponding differences are 55 per cent for large firms (d1) and 45 per cent for small firms (d2) respectively. Thus, d1 in Table 10.3 is almost exactly the same as d1 in Table 10.1, but d2 is 11 percentage points smaller (that is, 56 − 45). However, this latter finding would not be sufficient to suggest that the relationship is spurious because the difference between those who report varied work and those whose work is not varied is still large for both respondents in large firms and those in small firms. We do not expect an exact replication of percentage differences when we carry out such controls. Similarly, as suggested in the context of the discussion of Table 10.2, we do not need percentage differences to disappear completely in order to infer that a relationship is spurious. When there is an in-between reduction in percentage differences (for example, to around half of the original difference), the relationship is probably partially spurious, implying that part of it is caused by the third variable and the other part is indicative of a ‘true’ relationship. This would have been the interpretation if the original d1 difference of 56 per cent had fallen to around 28 per cent for respondents from both large firms and small firms.

Table 10.3 A non-spurious relationship: the relationship between work variety and job satisfaction, controlling for size of firm (imaginary data)

Testing for intervening variables

The quest for intervening variables is different from the search for potentially spurious relationships. An intervening variable is one that is both a product of the independent variable and a cause of the dependent variable. Taking the data examined in Table 10.1, the sequence depicted in Figure 10.2 might be imagined. The analysis presented in Table 10.4 strongly suggests that the level of people’s interest in their work is an intervening variable. As with Tables 10.2 and 10.3, we partition the sample into two groups (this time those who report that they are interested and those who report no interest in their work) and examine the relationship between work variety and job satisfaction for each group. Again, we can compare d1 in Table 10.1 with d1 and d2 in Table 10.4. In Table 10.1 d1 is 56 per cent, but in Table 10.4 d1 and d2 are 13 per cent and 20 per cent respectively. Clearly, d1 and d2 in Table 10.3 have not been reduced to zero (which would suggest that the whole of the relationship was through interest in work), but they are also much lower than the 56 per cent difference in Table 10.1. If d1 and d2 in Table 10.4 had remained at or around 56 per cent, we would conclude that interest in work is not an intervening variable.

The sequence in Figure 10.2 suggests that variety in work affects the degree of interest in work that people experience, which in turn affects their level of job satisfaction. This pattern differs from that depicted in Figure 10.1 in that, if the

Figure 10.2 Is the relationship between work variety and job satisfaction affected by an intervening variable?

Table 10.4 An intervening variable: the relationship between work variety and job satisfaction, controlling for interest in work (imaginary data)

analysis supported the hypothesised sequence, it suggests that there is a relationship between amount of variety in work and job satisfaction, but the relationship is not direct. The search for intervening variables is often referred to as explanation and it is easy to see why. If we find that a test variable acts as an intervening variable, we are able to gain some explanatory leverage on the bivariate relationship. Thus, we find that there is a relationship between amount of variety in work and job satisfaction and then ask why that relationship might exist. We speculate that it may be because those who have varied work become more interested in their work, which heightens their job satisfaction.

It should be apparent that the computation of a test for an intervening variable is identical to a test for spuriousness. How, then, do we know which is which? If we carry out an analysis like those shown in Tables 10.2, 10.3 and 10.4, how can we be sure that what we are taking to be an intervening variable is not in fact an indication that the relationship is spurious? The answer is that there should be only one logical possibility, that is, only one that makes sense. If we take the trio of variables in Figure 10.1, to argue that the test variable – size of firm – could be an intervening variable would mean that we would have to suggest that a person’s level of work variety affects the size of the firm in which he or she works – an unlikely scenario. Similarly, to argue that the trio in Figure 10.2 could point to a test for spuriousness would mean that we would have to accept that the test variable – interest in work – can affect the amount of variety in a person’s work. This too makes much less sense than to perceive it as an intervening variable.

One further point should be registered. It is clear that controlling for interest in work in Table 10.4 has not totally eliminated the difference between those reporting varied work and those whose work is not varied in terms of job satisfaction. It would seem, therefore, that there are aspects of the relationship between amount of variety in work and job satisfaction that are not totally explained by the test variable, interest in work.

Testing for moderated relationships

A moderated relationship occurs when a relationship is found to hold for some categories of a sample but not others. Diagrammatically this can be displayed as in Figure 10.3. We may even find the character of a relationship can differ for categories of the test variable. We might find that for one category those who report varied work exhibit greater job satisfaction, but for another category of people the reverse may be true (that is, varied work seems to engender lower levels of job satisfaction than work that is not varied).

Table 10.5 looks at the relationship between variety in work and job satisfaction for men and women. Once again, we can compare d1 (56 per cent) in Table 10.1 with d1 and d2 in Table 10.5, which are 85 per cent and 12 per cent respectively. The bulk of the 56 percentage point difference between those reporting varied work and those reporting that work is not varied in Table 10.1 appears to derive from the relationship between variety in work and job satisfaction being far

Figure 10.3 Is the relationship between work variety and job satisfaction moderated by gender?

Table 10.5 A moderated relationship: the relationship between work variety and job satisfaction, controlling for gender (imaginary data)

stronger for men than women and there being more men (300) than women (200) in the sample. Table 10.5 demonstrates the importance of searching for moderated relationships in that they allow the researcher to avoid inferring that a set of findings pertains to a sample as a whole, when in fact it only really applies to a portion of that sample. The term interaction effect is often employed to refer to the situation in which a relationship between two variables differs substantially for categories of the test variable. This kind of occurrence was also addressed in Chapter 9. The discovery of such an effect often inaugurates a new line of inquiry in that it stimulates reflection about the likely reasons for such variations.

The discovery of moderated relationships can occur by design or by chance. When they occur by design, the researcher has usually anticipated the possibility that a relationship may be moderated (though he or she may be wrong of course). They can occur by chance when the researcher conducts a test for an intervening variable or a test for spuriousness and finds a marked contrast in findings for different categories of the test variable.

Multiple causation

Dependent variables in the social sciences are rarely determined by one variable alone, so that two or more potential independent variables can usefully be considered in conjunction. Figure 10.4 suggests that whether someone is allowed participation in decision-making at work also affects their level of job satisfaction. It is misleading to refer to participation in decision-making as a test variable in this context, since it is really a second independent variable. What, then, is the impact of amount of variety in work on job satisfaction when we control the effects of participation?

Again, we compare d1 in Table 10.1 (56 per cent) with d1 and d2 in Table 10.6. The latter are 19 and 18 per cent respectively. This suggests that although the effect of amount of variety in work has not been reduced to zero or nearly zero, its impact has been reduced considerably. Participation in decision-making appears to be a more important cause of variation in job satisfaction. For example, compare the percentages in cells 1 and 3 in Table 10.6: among those respondents who report that they perform varied work, 93 per cent of those who experience participation exhibit job satisfaction, whereas only 30 per cent of those who do not experience participation are satisfied.

Figure 10.4 Multiple causation

Table 10.6 Multiple causation: the relationship between work variety and job satisfaction, controlling for participation at work (imaginary data)

One reason for this pattern of findings is that most people who experience participation in decision-making also have varied jobs, that is (cell1 + cell5) − (cell2 + cell6). Likewise, most people who do not experience participation have work which is not varied, that is (cell4 + cell8) − (cell3 + cell7). Could this mean that the relationship between variety in work and job satisfaction is really spurious, when participation in decision-making is employed as the test variable? The answer is that this is unlikely, since it would mean that participation in decision-making would have to cause variation in the amount of variety in work, which is a less likely possibility (since technological conditions tend to be the major influence on variables like work variety). Once again, we have to resort to a combination of intuitive logic and theoretical reflection in order to discount such a possibility. We shall return to this kind of issue in the context of an examination of the use of multivariate analysis through correlation and regression.

Using SPSS to perform multivariate analysis through contingency tables

Taking the Job Survey data, we might want to examine the relationship between skill and ethnicgp, holding gender constant (that is, as a test variable). Assuming that we want cell frequencies and column percentages, the following sequence would be followed:

➔Analyze ➔Descriptive Statistics ➔Crosstabs... [opens Crosstabs dialog box shown in Box 8.1]

➔skill ➔►button [puts skill in Row[s]: box] ➔ethnicgp ➔►button by Column[s]: [puts ethnicgp in box] ➔gender ➔►button by bottom box [puts gender in box] ➔Cells… [opens Crosstabs: Cell Display subdialog box shown in Box 8.3] [Ensure Observed in the Counts box has been selected and under Percentages ensure Column: has been selected] ➔Continue [closes Crosstabs: Cell Display subdialog box]

➔OK

Two contingency tables crosstabulating skill by ethnicgp will be produced – one for men and one for women. Each table will have ethnicgp going across (that is, as columns) and skill going down (that is, as rows).

MULTIVARIATE ANALYSIS AND CORRELATION

Although the use of contingency tables provides a powerful tool for multivariate analysis, it suffers from a major limitation, namely that complex analyses with more than three variables require large samples, especially when the variables include a large number of categories. Otherwise, there is the likelihood of very small frequencies in many cells (and indeed the likelihood of many empty cells) when a small sample is employed. By contrast, correlation and regression can be used to conduct multivariate analyses on fairly small samples, although their use in relation to very small samples is limited. Further, both correlation and regression provide easy to interpret indications of the relative strength of relationships. On the other hand, if one or more variables are nominal, multivariate analysis through contingency tables is probably the best way forward for most purposes.

The partial correlation coefficient

One of the main ways in which the multivariate analysis of relationships is conducted in the social sciences is through the partial correlation coefficient. This test allows the researcher to examine the relationship between two variables while holding one other or more variables constant. It allows tests for spuriousness, tests for intervening variables, and multiple causation to be investigated. The researcher must stipulate the anticipated logic that underpins the three variables in question (for example, test for spuriousness) and can then investigate the effect of the test variable on the original relationship. Moderated relationships are probably better examined by computing Pearson’s r for each category of the test variable (for example, for both men and women, or young, middle-aged and old) and then comparing the rs.

The partial correlation coefficient is computed by first calculating the Pearson’s r for each of the pairs of possible relationships involved. Thus, if the two variables concerned are x and y, and t is the test variable (or second independent variable in the case of investigating multiple causation), the partial correlation coefficient computes Pearson’s r for x and y, x and t, and y and t. Because of this, it is necessary to remember that all the restrictions associated with Pearson’s r apply to variables involved in the possible computation of the partial correlation coefficient (for example, variables must be interval).

There are three possible effects that can occur when partial correlation is undertaken: the relationship between x and y is unaffected by t; the relationship between x and y is totally explained by t; and the relationship between x and y is partially explained by t. Each of these three possibilities can be illustrated with Venn diagrams (see Figure 10.5). In the first case (a), t is only related to x, so the relationship between x and y is unchanged, because t can only have an impact on the relationship between x and y if it affects both variables. In the second case (b), all of the relationship between x and y (the shaded area) is encapsulated by t. This would mean that the relationship between x and y when t is controlled would be zero. What usually occurs is that the test variable, t, partly explains the relationship between x and y, as in the case of (c) in Figure 10.5. In this case, only part of the relationship between x and y is explained by t (the shaded area which is overlapped by t). This would mean that the partial correlation coefficient will be lower than the Pearson’s r for x and y. This is the most normal outcome of calculating the partial correlation coefficient. If the first-order correlation between x and y when t is Figure 10.5 The effects of controlling for a test variable

controlled is considerably less than the zero-order correlation between x and y, the researcher must decide (if he or she has not already done so) whether: (a) the x–y relationship is spurious, or at least largely so; or (b) whether t is an intervening variable between x and y; or (c) whether t is best thought of as a causal variable which is related to x and which largely eliminates the effect of x on y. These are the three possibilities represented in Figures 10.1, 10.2 and 10.4 respectively.

As an example, consider the data in Table 10.7. We have data on eighteen individuals relating to three variables: age, income and a questionnaire scale measuring support for the market economy, which goes from a minimum of 5 to a maximum of 25. The correlation between income and support for the market economy is 0.64. But could this relationship be spurious? Could it be that age should be Table 10.7 Income, age and support for the market economy (imaginary data)

Case number

Age

Income

Support for market £ economy

1

20

9000

11

2

23

8000

9

3

28

12500

12

4

30

10000

14

5

32

15000

10

6

34

12500

13

7

35

13000

16

8

37

14500

14

9

37

14000

17

10

41

16000

13

11

43

15500

15

12

47

14000

14

13

50

16500

18

14

52

12500

17

15

54

14500

15

16

59

15000

19

17

61

17000

22

18

63

16500

18

introduced as a test variable, since we might anticipate that older people are both more likely to earn more and to support the market economy? This possibility can be anticipated because age is related to income (0.76) and to support (0.83). When we compute the partial correlation coefficient for income and support controlling the effects of age, the level of correlation falls to 0.01. This means that the relationship between income and support for the market economy is spurious. When age is controlled, the relationship falls to nearly zero. A similar kind of reasoning would apply to the detection of intervening variables and multiple causation.

Partial correlation with SPSS

Imagine that we want to correlate absence, autonom and satis with each other, but controlling for income. We might think, for example, that the correlation of 0.73 between autonom and satis (see Table 8.7) might be due to income: if people have more autonomy, they are more likely to be given higher incomes, and this may make them more satisfied with their jobs. The following sequence would be followed:

➔Analyze ➔Correlate ➔Partial… [opens Partial Correlations dialog box shown in Box 10.1]

➔absence ➔►button by Variables: box [puts absence in Variables: box]

Box 10.1 Partial Correlations dialog box

Box 10.2 Partial Correlations: Options subdialog box

➔autonom ➔►button by Variables: box [puts autonom in Variables: box] ➔satis ➔►button by Variables: box [puts satis in Variables: box] ➔income ➔►button by Controlling for: box [puts income in Variables: box] ➔Two-tailed or One-tailed [depending on which form of Test of Significance you want and ensure box by Display actual significance level has been selected]

Table 10.8 Matrix of partial correlation coefficients (Job Survey data)

--- PARTIAL CORRELATION COEFFICIENTS ---

Controlling for..

INCOME

ABSENCE

AUTONOM

SATIS

ABSENCE

1.0000

-.0383

-.1652

( 0)

( 62)

( 62)

P= .

P= .764

P= .192

AUTONOM

-.0383

1.0000

.6955

( 62)

( 0)

( 62)

P= .764

P= .

P= .000

SATIS

-.1652

.6955

1.0000

( 62)

( 62)

( 0)

P= .192

P= .000

P= .

(Coefficient / (D.F.) / 2-tailed Significance)

" . " is printed if a coefficient cannot be computed

➔Options... [opens Partial Correlations: Options subdialog box shown in Box 10.2] ➔ either Exclude cases listwise or Exclude cases pairwise [depending on which way of handling missing cases you prefer] ➔Continue [closes Partial Correlations: Options subdialog box] ➔OK

The output from this sequence is presented in Table 10.8. Listwise deletions of missing cases was selected in this instance. A useful facility within the Partial procedure is that Pearson’s r can also be computed for all the possible pairs of variables listed in the Partial Correlations dialog box. In the Partial Correlations: Options subdialog box, simply select Zero-order correlations so that a tick appears in the box (if one is not there already).

In fact, there is almost no difference between the correlation of 0.73 between autonom and satis and the 0.70 when these same variables are correlated with income held constant (see Table 10.8). This suggests that the correlation between autonom and satis is unaffected by income. The original bivariate correlation is often referred to as a zero-order correlation (that is, with no variables controlled); when one variable is controlled, as in the case of income in this instance, the resulting correlation is known as a first-order correlation. Still higher order correlations can be obtained. For example, a second-order correlation would entail controlling for two variables, perhaps age and income. To do this within SPSS, simply add the additional variable(s) that you want to control for to the Controlling for: box in the Partial Correlations dialog box (Box 10.1).

REGRESSION AND MULTIVARIATE ANALYSIS

Nowadays regression, in the form of multiple regression, is the most widely used method for conducting multivariate analysis, particularly when more than three variables are involved. In Chapter 8 we previously encountered regression as a means of expressing relationships among pairs of variables. In this chapter, the focus will be on the presence of two or more independent variables.

Consider, first of all, a fairly simple case in which there are three variables, that is two independent variables. The nature of the relationship between the dependent variable and the two independent variables is expressed in a similar manner to the bivariate case explored in Chapter 8. The analogous equation for multivariate analysis is:

y = a + b1x1 + b2x2 + e

where x1 and x2 are the two independent variables, a is the intercept, b1 and b2 are the regression coefficients for the two independent variables, and e is an error term which points to the fact that a proportion of the variance in the dependent variable, y, is unexplained by the regression equation. As in Chapter 8, the error term is ignored since it is not used for making predictions.

In order to illustrate the operation of multiple regression consider the data in Table 10.7. The regression equation for these data is:

support = 5.913 + 0.21262age + 0.000008income

where 5.913 is the intercept (a), 0.21262 is the regression coefficient for the first independent variable, age (x1), and 0.000008 is the regression coefficient for the second independent variable, income ( x2). Each of the two regression coefficients estimates the amount of change that occurs in the dependent variable (support for the market economy) for a one unit change in the independent variable. Moreover, the regression coefficient expresses the amount of change in the dependent variable with the effect of all other independent variables in the equation partialled out (that is, controlled). Thus, if we had an equation with four independent variables, each of the four regression coefficients would express the unique contribution of the relevant variable to the dependent variable (with the effect in each case of the three other variables removed). This feature is of considerable importance, since the independent variables in a multiple regression equation are almost always related to each other.

Thus, every extra year of a person’s age increases support for the market economy by 0.21262, and every extra £1,000 increases support by 0.000008. Moreover, the effect of age on support is with the effect of income removed, and the effect of income on support is with the effect of age removed. If we wanted to predict the likely level of support for the market economy of someone aged 40 with an income of £17,500, we would substitute as follows:

y = 5.913 + (0.21262) (40) + (0.000008) (17500)

= 5.913 + 8.5048 + 0.014

= 14.56

Thus, we would expect that someone with an age of 40 and an income of £17,500 would have a score of 14.56 on the scale of support for the market economy.

While the ability to make such predictions is of some interest to social scientists, the strength of multiple regression lies primarily in its use as a means of establishing the relative importance of independent variables to the dependent variable. However, we cannot say that, simply because the regression coefficient for age is larger than that for income, this means that age is more important to support for the market economy than age. This is because age and income derive from different units of measurement that cannot be directly compared. In order to effect a comparison it is necessary to standardise the units of measurement involved. This can be done by multiplying each regression coefficient by the product of dividing the standard deviation of the relevant independent variable by the standard deviation of the dependent variable. The result is known as a standardised regression coefficient or beta weight. This coefficient is easily computed through SPSS. Standardised regression coefficients in a regression equation employ the same standard of measurement and therefore can be compared to determine which of two or more independent variables is the more important in relation to the dependent variable. They essentially tell us by how many standard deviation units the dependent variable will change for a one standard deviation change in the independent variable.

We can now take an example from the Job Survey data to illustrate some of these points. In the following example we will treat satis as the dependent variable and routine, autonom, age and income as the independent variables. These four independent variables were chosen because they are all known to be related to satis, as revealed by the relevant correlation coefficients. However, it is important to ensure that the independent variables are not too highly related to each other. The Pearson’s r between each pair of independent variables should not exceed 0.80; otherwise the independent variables that show a relationship at or in excess of 0.80 may be suspected of exhibiting multicollinearity. Multicollinearity is usually regarded as a problem because it means that the regression coefficients may be unstable. This implies that they are likely to be subject to considerable variability from sample to sample. In any case, when two variables are very highly correlated, there seems little point in treating them as separate entities. Multicollinearity can be quite difficult to detect where there are more than two independent variables, but SPSS provides some diagnostic tools that will be examined below.

When the previous multiple regression analysis is carried out using the Job Survey data, the following equation is generated:

satis = − 1.93582 + 0.572674autonom + 0.0006209income − 0.168445routine

The variable age was eliminated from the equation by the procedure chosen for including variables in the analysis (the stepwise procedure described below), 244 Multivariate analysis: exploring relationships

Table 10.9 Comparison of unstandardised and standardised regression coefficients with satis as the dependent variable

Independent variables

Unstandardised regression coefficients

Standardised regression coefficients

autonom

0.573

0.483

income

0.0006209

0.383

routine

−0.168

−0.217

[intercept]

− 1.936

—

because it failed to meet the program’s statistical criteria for inclusion. If it had been ‘forced’ into the equation, the impact of age on satis would have been almost zero. Thus, if we wanted to predict the likely satis score of someone with an autonom score of 16, an income of £16,000, and a routine score of 8, the calculation would proceed as follows:

satis = − 1.936 + (0.573)(16) + (0.0006209)(16000) − (0.168)(8)

= − 1.936 + 9.168 + 9.934 − 1.34

= 15.818

However, it is the relative impact of each of these variables on satis that provides the main area of interest for many social scientists. Table 10.9 presents the regression coefficients for the three independent variables remaining in the equation and the corresponding standardised regression coefficients. Although autonom provides the largest unstandardised and standardised regression coefficients, the case of income demonstrates the danger of using unstandardised coefficients in order to infer the magnitude of the impact of independent variables on the dependent variable. The variable income provides the smallest unstandardised coefficient (0.0006209), but the second largest standardised coefficient (0.383). As pointed out earlier, the magnitude of an unstandardised coefficient is affected by the nature of the measurement scale for the variable itself. The variable income has a range from 11,800 to 21,000, whereas a variable like routine has a range of only 4 to 20. When we examine the standardised regression coefficients, we can see that autonom has the greatest impact on satis and income the next highest. The variable routine has the smallest impact which is negative, indicating that more routine engenders less satis. Finally, in spite of the fact that the Pearson’s r between age and satis is moderate (0.35), when the three other variables are controlled, it does not have a sufficient impact on satis to avoid its exclusion through the program’s default criteria for elimination.

We can see here some of the strengths of multiple regression and the use of standardised regression coefficients. In particular, the latter allow us to examine the effects of each of a number of independent variables on the dependent variable. Thus, the standardised coefficient for autonom means that for each one unit change in autonom, there is a standard deviation change in satis of 0.483, with the effects of income and routine on satis partialled out.

Although we cannot compare unstandardised regression coefficients within a multiple regression equation, we can compare them across equations when the same measures are employed. We may, for example, want to divide a sample into men and women and to compute separate multiple-regression equations for each gender. To do this, we would make use of the Select Cases... procedure. Thus, for example, in the case of the multiple regression analysis we have been covering, the equation for men is:

satis = − 6.771 + 0.596autonom + 0.0007754income

and for women:

satis = − 1.146 + 0.678autonom + 0.0005287income − 0.179routine

Two features of this contrast are particularly striking. First, for men routine has not met the statistical criteria of the stepwise procedure and therefore is removed from the equation (in addition to age which failed to meet the statistical criteria for both men and women). Second, the negative constant is much larger for men than for women. Such contrasts can provide a useful springboard for further research. Also, it is potentially important to be aware of such subsample differences, since they may have implications for the kinds of conclusion that are generated. However, it must be borne in mind that variables must be identical for such contrasts to be drawn. An alternative approach would be to include gender as a third variable in the equation, since dichotomous variables can legitimately be employed in multiple regression. The decision about which option to choose will be determined by the points that the researcher wishes to make about the data.

One of the questions that we may ask is how well the independent variables explain the dependent variable. In just the same way that we were able to use r2 (the coefficient of determination) as a measure of how well the line of best fit represents the relationship between the two variables, we can compute the multiple coefficient of determination (R2) for the collective effect of all of the independent variables. The R2 value for the equation as a whole is .716, implying that only 28 per cent of the variance in satis (that is, 100 − 71.6) is not explained by the three variables in the equation. In addition, SPSS will produce an adjusted R2. The technical reasons for this variation should not overly concern us here, but the basic idea is that the adjusted version provides a more conservative estimate than the ordinary R2 of the amount of variance in satis that is explained. The adjusted R2 takes into account the number of subjects and the number of independent variables involved. The magnitude of R2 is bound to be inflated by the number of independent variables associated with the regression equation. The adjusted R2 corrects for this by adjusting the level of R2 to take account of the number of independent variables. The adjusted R2 for the equation as a whole is .702, which is just a little smaller than the non-adjusted value.

Another aspect of how well the regression equation fits the data is the standard error of the estimate. This statistic allows the researcher to determine the limits of the confidence that he or she can exhibit in the prediction from a regression equation. A statistic that is used more frequently (and which is also generated in SPSS output) is the standard error of the regression coefficient . The standard error of each regression coefficient reflects on the accuracy of the equation as a whole and of the coefficient itself. If successive similar-sized samples are taken from the population, estimates of each regression coefficient will vary from sample to sample. The standard error of the regression coefficient allows the researcher to determine the band of confidence for each coefficient. Thus, if b is the regression coefficient and s.e. is the standard error, we can be 95 per cent certain that the population regression coefficient will lie between b + (1.96 × s.e.) and b − (1.96 × s.e.). This confidence band can be established because of the properties of the normal distribution that were discussed in Chapter 6 and if the sample has been selected randomly. The confidence intervals for each regression coefficient can be generated by SPSS by making the appropriate selection, as in the illustration below. Thus, the confidence band for the regression coefficient for autonom will be between 0.573 + (1.96 × 0.096) and 0.573 − (1.96 × 0.096), that is, between 0.76 and 0.38. This confidence band means that we can be 95 per cent confident that the population regression coefficient for autonom will lie between 0.76 and 0.38. This calculation can be extremely useful when the researcher is seeking to make predictions and requires a sense of their likely accuracy.

Statistical significance and multiple regression

A useful statistical test that is related to R2 is the F ratio. The F ratio test generated by SPSS is based on the multiple correlation ( R) for the analysis. The multiple correlation, which is of course the square root of the coefficient of determination, expresses the correlation between the dependent variable ( satis) and all of the independent variables collectively (that is, autonom, routine, age and income). The multiple R for the multiple-regression analysis under consideration is 0.846. The F ratio test allows the researcher to test the null hypothesis that the multiple correlation is zero in the population from which the sample (which should be random) was taken. For our computed equation, F = 51.280 (see Table 10.10, ANOVA table, bottom row) and the significance level is 0.000 (which means p < 0.0005), suggesting that it is extremely improbable that R in the population is zero.

The calculation of the F ratio is useful as a test of statistical significance for the equation as a whole, since R reflects how well the independent variables collectively correlate with the dependent variable. If it is required to test the statistical significance of the individual regression coefficients, a different test must be used. A number of approaches to this question can be found. Two approaches which can be found within SPSS will be proffered. First, a statistic that is based on the F ratio calculates the significance of the change in the value of R2 as a result of the inclusion of each additional variable in an equation. Since each variable is entered into the equation in turn, the individual contribution of each variable to R2 is calculated and the statistical significance of that contribution can be assessed. In the computation of the multiple-regression equation, a procedure called stepwise

Table 10.10 SPSS multiple regression output (Job Survey data)

Variables Entered/Removeda

Model

Variables Entered

Variables Removed

Method

1

Stepwise (Criteria: Probabilit y -of-F-to-e nter <= .050, Probabilit y -of-F-to-r emove >= .100). Stepwise (Criteria: Probabilit y -of-F-to-e nter <= .050, Probabilit y -of-F-to-r emove >= .100). Stepwise (Criteria: Probabilit y -of-F-to-e nter <= .050, Probabilit y -of-F-to-r emove >= .100).

AUTONOM

2

INCOME

3

ROUTINE

a. Dependent Variable: SATIS

Model Summary

Model

R

R Square

Adjusted R Square

Std. Error of the Estimate

Change Statistics

R Square Change

F Change

df1

df2

Sig. F Change

1

.724a

.524

.516

2.27

.524

69.357

1

63

.000

2

.826b

.682

.672

1.87

.158

30.890

1

62

.000

3

.846c

.716

.702

1.78

.034

7.254

1

61

.009

a. Predictors: (Constant), AUTONOM

b. Predictors: (Constant), AUTONOM, INCOME

c. Predictors: (Constant), AUTONOM, INCOME, ROUTINE

ANOVAd

Model

Sum of Squares

df

Mean Square

F

Sig.

1

Regression

357.346

1

357.346

69.357

.000a

Residual

324.593

63

5.152

Total

681.938

64

2

Regression

465.288

2

232.644

66.577

.000b

Residual

216.650

62

3.494

Total

681.938

64

3

Regression

488.315

3

162.772

51.280

.000c

Residual

193.624

61

3.174

Total

681.938

64

a. Predictors: (Constant), AUTONOM

b. Predictors: (Constant), AUTONOM, INCOME

c. Predictors: (Constant), AUTONOM, INCOME, ROUTINE

d. Dependent Variable: SATIS

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

95% Confidence Interval for B

Collinearity Statistics

B

Std. Error

Beta

Lower Bound

Upper Bound

Tolerance

VIF

1

(Constant)

2.815

1.019

.724

2.763

.007

.779

4.851

1.000

1.000

AUTONOM

.859

.103

8.328

.000

.653

1.065

2

(Constant)

-6.323

1.846

.577

-3.425

.001

-10.012

-2.693

.881

1.136

AUTONOM

.685

.091

7.569

.000

.504

.866

INCOME

6.881E-04

.000

.424

5.558

.000

.000

.001

.881

1.136

3

(Constant)

-1.936

2.397

.483

-.807

.423

-6.730

2.858

.713

1.402

AUTONOM

.573

.096

5.974

.000

.381

.764

INCOME

6.209E-04

.000

.383

5.148

.000

.000

.001

.843

1.186

ROUTINE

-.168

.063

-.217

-2.693

.009

-.294

-.043

.716

1.397

a. Dependent Variable: job satisfaction

Excluded Variables d

Model

Beta In

t

Sig.

Partial Correlation

Collinearity Statistics

Tolerance

VIF

Minimum Tolerance

1

ROUTINE

-.303a

-3.234

.002

-.380

.748

1.338

.748

AGE

.194a

2.244

.028

.274

.947

1.056

.947

INCOME

.424a

5.558

.000

.577

.881

1.136

.881

2

ROUTINE

-.217b

-2.693

.009

-.326

.716

1.397

.713

AGE

-.079b

-.861

.392

-.110

.606

1.649

.564

3

AGE

.006c

.061

.951

.008

.529

1.891

.483

a. Predictors in the Model: (Constant), AUTONOM

b. Predictors in the Model: (Constant), AUTONOM, INCOME

c. Predictors in the Model: (Constant), AUTONOM, INCOME, ROUTINE

d. Dependent Variable: job satisfaction

Collinearity Diagnostics a

Model

Dimension

Eigenvalue

Condition Index

Variance Proportions

(Constant)

AUTONOM

INCOME

ROUTINE

1

1

1.961

1.000

.02

.02

2

3.894E-02

7.096

.98

.98

2

1

2.946

1.000

.00

.01

.00

2

4.645E-02

7.964

.07

.97

.04

3

7.837E-03

19.387

.93

.03

.96

3

1

3.843

1.000

.00

.00

.00

.00

2

.126

5.513

.00

.17

.00

.29

3

2.469E-02

12.476

.04

.79

.22

.37

4

5.629E-03

26.131

.96

.04

.78

.34

a. Dependent Variable: SATIS

for deciding the sequence of the entry of variables into the equation was employed. An explanation of this procedure will be given in the next section, but in the meantime it may be noted that it means that each variable is entered according to the magnitude of its contribution to R2. Thus an examination of the SPSS output (Table 10.10) shows that the variables were entered in the sequence: autonom, income, routine (age was not entered). The contribution of autonom to R2 was 0.524 (see model summary section of Table 10.10). When income was entered the R2 became 0.682, suggesting that this variable added 0.158 (that is, 0.682 − 0.524) to R2. The variable routine added a further 0.034 (that is, 0.716 − 0.682). Clearly, autonom was by far the major contributor to R2. In each case, an F test of the change in R2 shows that the change was statistically significant. The significance levels for the R2 changes as a result of the inclusion of autonom and income were 0.000 in each case; the significance level for the R2 change as a result of the inclusion of routine was 0.009.

SPSS will produce a test of the statistical significance of individual regression coefficients through the calculation of a t value for each coefficient and an associated two-tailed significance test. As the output in Table 10.10 indicates (see coefficients section of table, bottom row), the significance levels for autonom and income were 0.000, and for routine 0.009. These are consistent with the previous analysis using the F ratio and suggest that the coefficients for income, autonom and routine are highly unlikely to be zero in the population.

Multiple regression and SPSS

The regression program within SPSS has quite a large range of options (which can be further enhanced if command syntax is used – see Bryman and Cramer, 1994:243–8) and can generate a large amount of output. In this section it is proposed to simplify these elements as far as possible by dealing with the multiple-regression equation that was the focus of the preceding section and to show how this was generated with SPSS. The output is presented in Table 10.10. The sequence of actions to generate this output is as follows:

➔Analyze ➔Regression ➔Linear... [opens Linear Regression dialog box shown in Box 10.3]

➔satis ➔►button [puts satis in Dependent: box] ➔autonom ➔►button [puts autonom in Independent[s]: box] ➔routine ➔►button [puts routine in Independent[s]: box] ➔age ➔►button [puts age in Independent[s]: box] ➔ income ➔►button [puts income in Independent[s]: box] ➔downward pointing arrow in box by Method: ➔Stepwise ➔Statistics... [opens Linear Regression: Statistics subdialog box shown in Box 10.4]

under Regression Coefficients ➔Confidence intervals ➔Collinearity diagnostics [if not already selected] ➔R squared change ➔Continue [closes Linear Regression: Statistics subdialog box]

➔OK

Box 10.4 Linear Regression: Statistics subdialog box

In this sequence of actions, R squared change was selected because it provides information about the statistical significance of the R2 change as a result of the inclusion of each variable in the equation. Collinearity diagnostics was selected because it generates helpful information about multicollinearity. Confidence intervals was selected because it provides the confidence interval for each regression coefficient. Model fit is a default selection in SPSS and should not normally be de-selected.

The output in Table 10.10 was produced with cases with missing values being omitted on a listwise basis, which is the default within SPSS. Thus, a case is excluded if there is a missing value for any one of the five variables involved in the equation. Missing values can also be dealt with on a pairwise basis, or the mean for the variable can be substituted for a missing value. To change the basis for excluding missing values, click on Options... in the Linear Regression dialog box. The Linear Regression: Options subdialog box opens. In the Missing Values box click on whichever approach to handling missing values is preferred and then click on Continue. You will then be back in the Linear Regression dialog box.

The output in Table 10.10 provides a large amount of regression information. The table with the heading ‘Variables Entered/Removed’ outlines the order in which the variables were included in the analysis. Model 1 includes just autonom, Model 2 includes both autonom and income, and Model 3 includes all the variables that fulfilled the statistical criteria of the stepwise procedure. By implication, age did not meet the criteria. The following elements in the output relate to aspects of multiple regression that have been covered above:

1

Information about the Multiple R, R Square, Adjusted R Square, and the Standard Error of the Estimate are given in the table headed ‘Model Summary’. This tells us, for example, that the R Square is 0.716 once routine has followed autonom and income into the equation, suggesting that around 72 per cent of the variance in satis is explained by these three variables.

2

Also in the ‘Model Summary’ table is information about the R Square Change, showing the amount that each variable contributes to R Square, the F test value of the change, and the associated level of statistical significance.

3

Below the heading ‘ANOVA’ is an analysis of variance table, which can be interpreted in the same way as the ANOVA procedure described in Chapter 7. The analysis of variance table has not been discussed in the present chapter because it is not necessary to an understanding of regression for our current purposes. The information in the table that relates to Model 3 provides the F ratio for the whole equation (51.280) which is shown to be significant at 0.0005 (Sig = .000).

4

In the table with the heading ‘Coefficients’, are the following important bits of summary information for the equation as a whole (Model 3): B (the unstandardised regression coefficient) for each of the three variables and the constant; Std. Error (the standard error of the regression coefficient) for each of the three variables and the constant; Beta (the standardised regression coefficient) for each of the three variables; the t value (t) for each unstandardised regression coefficient; the significance of the t value (Sig.); and the 95 per cent confidence interval for each of the unstandardised coefficients . The information in Table 10.9 was extracted from this section of the output.

5

Information about multicollinearity is given in the table with the heading ‘Coefficients’. This information can be sought in the column Tolerance for Model 3. The Tolerance statistic is derived from 1 minus the multiple R for each independent variable. The multiple R for each independent variable is made up of its correlation with all of the other independent variables. When the tolerance is low, the multiple correlation is high and there is the possibility of multicollinearity. The tolerances for autonom, routine, and income are 0.713, 0.716, and 0.843 respectively, suggesting that multicollinearity is unlikely. If the tolerance figures had been close to zero, multicollinearity would have been a possibility.

As we have seen, age never enters the equation because it failed to conform to the criteria for inclusion operated by the stepwise procedure. This is one of a number of approaches that can be used in deciding how and whether independent variables should be entered in the equation and is probably the most commonly used approach. Although popular, the stepwise method is none the less controversial because it affords priority to statistical criteria for inclusion rather than theoretical ones. Independent variables are entered only if they meet the package’s statistical criteria (though these can be adjusted) and the order of inclusion is determined by the contribution of each variable to the explained variance. The variables are entered in steps, with the variable that exhibits the highest correlation with the dependent variable being entered at the first step (that is, autonom). This variable must also meet the program’s criteria for inclusion in terms of the required F ratio value. The variable that exhibits the largest partial correlation with the dependent variable (with the effect of the first independent variable partialled out) is then entered (that is, income). This variable must then meet the F ratio default criteria. The variable age does not meet the necessary criteria and is therefore not included in the equation. In addition, as each new variable is entered, variables that are already in the equation are reassessed to determine whether they still meet the necessary statistical criteria. If they do not, they are removed from the equation.

PATH ANALYSIS

The final area to be examined in this chapter, path analysis, is an extension of the multiple regression procedures explored in the previous section. In fact, path analysis entails the use of multiple regression in relation to explicitly formulated causal models. Path analysis cannot establish causality; it cannot be used as a substitute for the researcher’s views about the likely causal linkages among groups of variables. All it can do is examine the pattern of relationships between three or more variables, but it can neither confirm nor reject the hypothetical causal imagery.

The aim of path analysis is to provide quantitative estimates of the causal connections between sets of variables. The connections proceed in one direction and are viewed as making up distinct paths. These ideas can best be explained with reference to the central feature of a path analysis – the path diagram. The path diagram makes explicit the likely causal connections between variables. An example is provided in Figure 10.6 which takes four variables employed in the Job Survey: age, income, autonom and satis. The arrows indicate expected causal connections between variables. The model moves from left to right, implying causal priority to those variables closer to the left. Each p denotes a causal path and hence a path coefficient that will need to be computed. The model proposes that age has a direct effect on satis (p1). But indirect effects of age on satis are also proposed: age affects income (p5) which in turn affects satis (p6); age affects autonom (p2) which in turn affects satis (p3); and age affects autonom (p2) again, but this time affects income (p4) which in turn affects satis (p6). In addition, autonom has a direct effect on satis (p3) and an indirect effect whereby it affects income (p4) which in turn affects satis (p6). Finally, income has a direct effect on satis (p6), but no indirect effects. Thus, a direct effect occurs when a variable has an effect on another variable without a third variable intervening between them; an indirect effect occurs when there is a third intervening variable through which two variables are connected.

In addition, income, autonom and satis have further arrows directed to them from outside the nexus of variables. These refer to the amount of unexplained variance for each variable respectively. Thus, the arrow from e1 to autonom (p7) refers to the amount of variance in autonom that is not accounted for by age. Likewise, the arrow from e2 to satis (p8) denotes the amount of error arising from the variance Figure 10.6 Path diagram for satis

in satis that is not explained by age, autonom and income. Finally, the arrow from e3 to income (p9) denotes the amount of variance in income that is unexplained by age and autonom. These error terms point to the fact that there are other variables that have an impact on autonom and satis, but which are not included in the path diagram.

In order to provide estimates of each of the postulated paths, path coefficients are computed. A path coefficient is a standardised regression coefficient. The path coefficients are computed by setting up three structural equations, that is equations which stipulate the structure of hypothesised relationships in a model. In the case of Figure 10.6, three structural equations will be required – one for autonom, one for satis and one for income. The three equations will be:

autonom = x1age + e1

(10.1)

satis = x1age + x2autonom + x3income + e2

(10.2)

income = x1age + x2autonom + e3

(10.3)

The standardised coefficient for age in (10.1) will provide p2. The coefficients for age, autonom and income in (10.2) will provide p1, p3 and p6 respectively. Finally, the coefficients for age and autonom in (10.3) will provide p5 and p4 respectively.

Thus, in order to compute the path coefficients, it is necessary to treat the three equations as multiple-regression equations and the resulting standardised regression coefficients provide the path coefficients. The intercepts in each case are ignored. The three error terms are calculated by taking the R2 for each equation away from 1 and taking the square root of the result of this subtraction.

In order to complete all the paths in Figure 10.6, all the path coefficients will have to be computed. The stepwise procedure should therefore not be used because, if certain variables do not enter the equation due to the program’s default criteria for inclusion and exclusion, no path coefficients can be computed for them. In SPSS, instead of choosing Stepwise in the Method: box (see Box 10.3), choose Enter, which will force all variables into the equation.

Therefore, to compute equation (10.1) the following steps would need to be followed (assuming that listwise deletion of missing cases has already been selected):

➔Statistics ➔Regression ➔Linear... [opens Linear Regression dialog box shown in Box 10.3]

➔autonom ➔►button [puts autonom in Dependent: box] ➔age ➔►button [puts age in Independent[s]: box] ➔downward pointing arrow in box by Method: ➔Enter ➔OK

For equation (10.2):

➔Statistics ➔Regression ➔Linear... [opens Linear Regression dialog box shown in Box 10.3] ➔satis ➔►button [puts satis in Dependent: box] ➔age

➔>button [puts age in Independent[s]: box] ➔autonom ➔>button [puts autonom in Independent[s]: box] ➔income ➔>button [puts income in Independent[s]: box] ➔downward pointing arrow in box by Method: ➔Enter

➔OK

For equation (3):

➔Statistics ➔Regression ➔Linear... [opens Linear Regression dialog box shown in Box 10.3] ➔income ➔►button [puts income in Dependent: box]

➔age ➔►button [puts age in Independent[s]: box] ➔autonom ➔►button [puts autonom in Independent[s]: box] ➔downward pointing arrow in box by Method: ➔Enter ➔OK

When conducting a path analysis the critical issues to search for in the SPSS output are the standardised regression coefficient for each variable (under the heading Beta in the last section of the table) and the R2 (for the error term paths). If we take the results of the third equation, we find that the standardised coefficients for autonom and age are 0.215 and 0.567 respectively and the R2 is 0.426. Thus for p4, p5 and p9 in the path diagram (Figure 10.7) we substitute 0.22, 0.57, and 0.76 (the latter being the square root of 1 − 0.42604). All of the relevant path coefficients have been inserted in Figure 10.7.

Since the path coefficients are standardised, it is possible to compare them directly. We can see that age has a very small negative direct effect on satis, but it has a number of fairly pronounced positive indirect effects on satis. In particular, there is a strong sequence that goes from age to income (p5 = 0.57) to satis (p6 = 0.47).

Many researchers recommend calculating the overall impact of a variable like age on satis. This would be done as follows. We take the direct effect of age (− 0.08) and add to it the indirect effects. The indirect effects are gleaned by multiplying the coefficients for each path from age to satis. The paths from age to income to satis would be calculated as (0.57)(0.47) = 0.27. For the paths from age

Figure 10.7 Path diagram for satis with path coefficients

to autonom to satis we have (0.28)(0.58) = 0.16. Finally, the sequence from age to autonom to income to satis yields (0.28)(0.22)(0.47) = 0.03. Thus the total indirect effect of age on satis is 0.27 + 0.16 + 0.03 = 0.46. For the total effect of age on satis, we add the direct effect and the total indirect effect, that is, −0.08 + 0.46 = 0.38. This exercise suggests that the indirect effect of age on satis is inconsistent with its direct effect, since the former is slightly negative and the indirect effect is positive. Clearly, an appreciation of the intervening variables income and autonom is essential to an understanding of the relationship between age and satis.

The effect of age on satis could be compared with the effect of other variables in the path diagram. Thus, the effect of autonom is made up of the direct effect (0.57) plus the indirect effect of autonom to income to satis, that is, 0.58 + (0.22)(0.47), which equals 0.68. The effect of income on satis is made up only of the direct effect, which is 0.47, since there is no indirect effect from income to satis. Thus, we have three effect coefficients as they are often called (for example, Pedhazur, 1982) – 0.38, 0.68 and 0.47 for age, autonom and income respectively – implying that autonom has the largest overall effect on satis.

Sometimes, it is not possible to specify the causal direction between all the variables in a path diagram. In Figure 10.8 autonom and routine are deemed to be correlates; there is no attempt to ascribe causal priority to one or the other. The link between them is indicated by a curved arrow with two heads. Each variable has a direct effect on absence (p 5 and p4). In addition, each variable has an indirect effect on absence through satis: autonom to satis (p1) and satis to absence (p3); routine to satis (p2) and satis to absence (p3). In order to generate the necessary coefficients, we would need the Pearson’s r for autonom and routine and the standardised regression coefficients from two equations:

satis = a + x1autonom + x2routine + e1

(10.4)

absence = a + x1autonom + x2routine + x3satis + e2

(10.5)

Figure 10.8 Path diagram for absence

We could then compare the total causal effects of autonom, routine and satis. The total effect would be made up of the direct effect plus the total indirect effect. The total effect of each of these three variables on absence would be:

Total effect of autonom = (p5) + (p1)(p3)

Total effect of routine = (p4) + (p2)(p3)

Total effect of satis = p3

These three total effects can then be compared to establish which has the greatest overall effect on absence. However, with complex models involving a large number of variables, the decomposition of effects using the foregoing procedures can prove unreliable, and alternative methods have to be employed (Pedhazur, 1982).

Path analysis has become a popular technique because it allows the relative impact of variables within a causal network to be estimated. It forces the researcher to make explicit the causal structure that is believed to underpin the variables of interest. On the other hand, it suffers from the problem that it cannot confirm the underlying causal structure. It tells us what the relative impact of the variables upon each other is, but cannot validate that causal structure. Since a cause must precede an effect, the time order of variables must be established in the construction of a path diagram. We are forced to rely on theoretical ideas and our common-sense notions for information about the likely sequence of the variables in the real world. Sometimes these conceptions of time ordering of variables will be faulty and the ensuing path diagram will be misleading. Clearly, while path analysis has much to offer, its potential limitations should also be appreciated. In this chapter, it has only been feasible to cover a limited range of issues in relation to path analysis and the emphasis has been upon the use of examples to illustrate some of the relevant procedures, rather than a formal presentation of the issues. Readers who require more detailed treatments should consult Land (1969), Pedhazur (1982) and Davis (1985).

EXERCISES

1

A researcher hypothesises that women are more likely than men to support legislation for equal pay between the sexes. The researcher decides to conduct a social survey and draws a sample of 1,000 individuals among whom men and women are equally represented. One set of questions asked directs the respondent to indicate whether he or she approves of such legislation. The findings are provided in Table 10E.1. Is the researcher’s belief that women are more likely than men to support equal pay legislation confirmed by the data in Table 10E.1?

Table 10E.1 The relationship between approval of equal pay legislation and gender

Men %

Women %

Approve

58

71

Disapprove

42

29

Total

100

100

N=

500

500

2

Following on from Question 1, the researcher controls for age and the results of the analysis are provided in Table 10E.2. What are the implications of this analysis for the researcher’s view that men and women differ in support for equal pay legislation?

Table 10E.2 The relationship between approval of equal pay legislation and gender controlling for age

Under 35 (%)

35 and over (%)

Men

Women

Men

Women

Approve

68

92

48

54

Disapprove

32

8

52

46

Total

100

100

100

100

N=

250

250

250

250

3

What SPSS procedure would be required to examine the relationship between ethnicgp and commit, controlling for gender? Assume that you want ethnicgp going across that table and that you need both frequency counts and column percentages.

4

A researcher is interested in the correlates of the number of times that people attend religious services during the course of a year. On the basis of a sample of individuals, he finds that income correlates fairly well with frequency of attendance (Pearson’s r = 0.59). When the researcher controls for the effects of age the partial correlation coefficient is found to be 0.12. Why has the size of the correlation fallen so much?

5

What SPSS procedure would you need to correlate income and satis, controlling for age? Assume that you want to display actual significance levels and that missing cases are to be deleted listwise.

6

Consider the following regression equation and other details:

y = 7.3 + 2.3x1 + 4.1x2 − 1.4x3R2 = 0.78 F = 21.43, p < 0.01

(a)

What value would you expect y to exhibit if x1 = 9, x2 = 22, and x3 = 17?

(b)

How much of the variance in y is explained by x1, x2 and x3?

(c)

Which of the three independent variables exhibits the largest effect on y?

(d)

What does the negative sign for x3 mean?

7

What SPSS procedure would you need to provide the data for the multiple regression equations on p. 257? In considering the commands, you should bear in mind that the information is required for a path analysis.

8

Turning to the first of the two equations referred to in question 7 (that is, the one with satis as the dependent variable):

(a)

How much of the variance in satis do the two variables account for?

(b)

Are the individual regression coefficients for autonom and routine statistically significant?

(c)

What is the standardised regression coefficient for routine?

9

Examine Figure 10.8. Using the information generated for questions 7 and 8, which variable has the largest overall effect on absence – is it autonom, routine or satis?

## Thứ Sáu, ngày 19 tháng 3 năm 2010

Đăng ký:
Đăng Nhận xét (Atom)

## Không có nhận xét nào:

## Đăng nhận xét