Two weeks ago I claimed that women report higher job satisfaction when they work in countries where tech is more male-dominated. And then instead of backing up my claim last week, I got sidetracked by questions of sample size and statistical power.

In a previous blog post I introduced the Kaggle survey on women in tech and I did some basic data cleaning for that survey. To save time and get to the point, I now pick up where I left off. In this post I use linear regression, multilevel regression, and k-fold cross validation techniques to show how job satisfaction varies across countries depending on the percentage of women working in tech.

The survey has a lot of interesting variables, but for today we’re just interested in a subset of the data.

• From the file “multipleChoiceResponses.csv”: female, Country, CompensationAmount, JobSatisfaction, Tenure
• From the file “conversionRates.csv”: originCountry, exchangeRate
library(knitr)

Varname <- names(ks)
Description <- c("Country of residence", "Is respondent female? 1 = female, 0 = male. Nonbinary = removed", "Current total yearly compensation (salary + bonus)", "Three letter country code abbreviation for country of currency", "Satisfaction with current job? (scale 0-10)")
table <- as.data.frame(cbind(Varname, Description))
kable(table[1:5,])
Varname Description
Country Country of residence
female Is respondent female? 1 = female, 0 = male. Nonbinary = removed
CompensationAmount Current total yearly compensation (salary + bonus)
CompensationCurrency Three letter country code abbreviation for country of currency
JobSatisfaction Satisfaction with current job? (scale 0-10)
Varname <- names(cr)
Description <- c("Three letter country code abbreviation for country of currency", "Exchange rate to USD")
table <- as.data.frame(cbind(Varname, Description))
kable(table[1:2,])
Varname Description
originCountry Three letter country code abbreviation for country of currency
exchangeRate Exchange rate to USD

## Preparing the variables

CompensationAmount is (often) given in local currency, so we need to merge the survey responses with the conversionRates.csv file to convert local salary to USD equivalent.

cr$CompensationCurrency <- cr$originCountry # rename variable for matching datasets
cr <- cr[,c("CompensationCurrency","exchangeRate")]

# merge datasets
ks <- merge(ks,cr,by="CompensationCurrency", all=TRUE)
# View(ksi[,c("CompensationAmount","CompensationCurrency","exchangeRate")])

# clean up global environment
rm(cr)

# clean up CompensationAmount
# missing values are sometimes strings instead of NA
is.na(ks$CompensationAmount) <- ks$CompensationAmount==""

# numbers are inconsistently written. Change comma as hundreds separator to "" and make numeric
ks$CompensationAmount <- as.numeric(gsub(",", "", ks$CompensationAmount))
ks$CompensationAmount <- ifelse(ks$CompensationAmount<=0,NA,ks$CompensationAmount) # make negatives and 0 missing, because these are weird income values # multiply income*exchangeRate to get income in USD ks$CompUSD <- ks$CompensationAmount*ks$exchangeRate

# income not normally distributed, take log
ks$CompUSDLog <- log(ks$CompUSD)

JobSatifaction is a factor, but with 10 categories we can treat it like a numeric variable.

## convert from factor to numeric
ks$jobsat <- as.numeric(with(ks, ifelse((ks$JobSatisfaction=="1 - Highly Dissatisfied"), "1",
ifelse((ks$JobSatisfaction=="10 - Highly Satisfied"), "10", ifelse((ks$JobSatisfaction %in% c("I prefer not to share", "")), NA,
ifelse((ks$JobSatisfaction==2),2, ifelse((ks$JobSatisfaction==3),3,
ifelse((ks$JobSatisfaction==4),4, ifelse((ks$JobSatisfaction==5),5,
ifelse((ks$JobSatisfaction==6),6, ifelse((ks$JobSatisfaction==7),7,
ifelse((ks$JobSatisfaction==8),8, ifelse((ks$JobSatisfaction==9),9,NA)))))))))))))

## Country descriptives

Now the data is clean, lets run some country-level descriptives

# Make country level descriptives
Countrydescriptives <- ks %>%
group_by(Country) %>%
summarize(meanfem=mean(female, na.rm=T),sdfem=sd(female, na.rm=T),
meaninc=mean(CompUSDLog, na.rm=T),
meanJobSat=mean(jobsat, na.rm=T),sdJobSat=sd(jobsat,na.rm=T))

# I also want mean of job satisfaction by country and gender. To do this I make two datasets for male and female country averages, then merge
Countryfemdesc <- ks[ks$female==1,] %>% group_by(Country) %>% summarize(meanfemJobSat=mean(jobsat, na.rm=T)) Countrymaledesc <- ks[ks$female==0,] %>%
group_by(Country) %>%
summarize(meanmaleJobSat=mean(jobsat, na.rm=T))
Countrygenderdesc <- merge(Countryfemdesc,Countrymaledesc,by="Country")
rm(Countryfemdesc,Countrymaledesc)

# merge both country level datasets
Countrydescriptives <- merge(Countrydescriptives,Countrygenderdesc,by="Country")

# country descriptives
Countrydescriptives[,c("Country","meanfem","meanJobSat")]
##           Country    meanfem meanJobSat
## 1       Argentina 0.11235955   7.090909
## 2       Australia 0.19377990   6.988764
## 3         Belarus 0.13207547   6.928571
## 4         Belgium 0.10000000   6.500000
## 5          Brazil 0.08225108   6.978022
## 7           Chile 0.13725490   7.307692
## 8           China 0.16479401   5.905109
## 9        Colombia 0.15178571   7.363636
## 10 Czech Republic 0.05660377   7.264706
## 11        Denmark 0.05263158   7.081081
## 12          Egypt 0.28787879   6.500000
## 13        Finland 0.07812500   6.735294
## 14         France 0.13990826   7.009174
## 15        Germany 0.13318777   6.991304
## 16         Greece 0.17500000   7.161290
## 17      Hong Kong 0.17460317   6.032258
## 18        Hungary 0.16923077   6.387097
## 19          India 0.15361559   6.071823
## 20      Indonesia 0.22834646   7.187500
## 21           Iran 0.25454545   6.806452
## 22        Ireland 0.31914894   6.395349
## 23         Israel 0.10576923   7.886792
## 24          Italy 0.11344538   6.786885
## 25          Japan 0.07299270   5.927273
## 26          Kenya 0.17241379   5.611111
## 27       Malaysia 0.30769231   6.120000
## 28         Mexico 0.07936508   7.477612
## 29    Netherlands 0.10552764   6.948980
## 30    New Zealand 0.19178082   6.580645
## 31        Nigeria 0.12500000   6.103448
## 32         Norway 0.09615385   6.750000
## 33       Pakistan 0.17721519   6.204082
## 34    Philippines 0.27848101   6.200000
## 35         Poland 0.11111111   6.766234
## 36       Portugal 0.20430108   7.048780
## 37        Romania 0.25423729   7.200000
## 38         Russia 0.17452007   6.977376
## 39      Singapore 0.20329670   6.379310
## 40   South Africa 0.16000000   6.585714
## 41    South Korea 0.19170984   6.384615
## 42          Spain 0.14511041   6.826087
## 43         Sweden 0.08988764   7.974359
## 44    Switzerland 0.15625000   7.323944
## 45         Taiwan 0.20553360   6.174419
## 46         Turkey 0.21985816   6.365854
## 47        Ukraine 0.16836735   7.032787
## 48 United Kingdom 0.16730038   6.767442
## 49  United States 0.21144520   6.975467
## 50        Vietnam 0.18571429   6.090909
## 51           <NA>        NaN        NaN
# merge with individual level data
ks <- merge(ks,Countrydescriptives,by="Country")
rm(Countrygenderdesc)

In the above table you can search per country to see the proportion of women and the mean job satisfaction.

## Plotting descriptives

The table has a lot of information and it’s not very informative in an of itself. In the following figures I try to make this information more tangible, starting with countries ranked by the proportion of women in tech.

# Subset Countrydescriptives and create confidence intervals for proportion women in tech per country
CDSmall <- Countrydescriptives[order(Countrydescriptives$meanfem, decreasing = TRUE),c("Country","meanfem","sdfem")] CDSmall <- CDSmall[!is.na(CDSmall$Country),]
CDSmall$lbfem <- CDSmall$meanfem - 1.96*(CDSmall$sdfem/(sqrt(length(CDSmall$meanfem)))) ## lower bound confidence interval
CDSmall$ubfem <- CDSmall$meanfem + 1.96*(CDSmall$sdfem/(sqrt(length(CDSmall$meanfem)))) ## upper bound confidence interval
CDSmall <- CDSmall[,c("Country","meanfem","lbfem","ubfem")]

# order by point
CDSmall$Country <- factor(CDSmall$Country, levels=CDSmall[order(CDSmall$meanfem, decreasing = FALSE), "Country"]) # figure p <- ggplot(CDSmall, aes(x=Country, y=meanfem, ymin=lbfem, ymax=ubfem))+ geom_pointrange()+ geom_hline(yintercept = 0, linetype=2)+ coord_flip()+ xlab('Country') p In another figure I sort countries by job satisfaction (ranked highest to lowest). # Subset Countrydescriptives and create confidence intervals for job satisfaction rm(CDSmall) CDSmall <- Countrydescriptives[order(Countrydescriptives$meanJobSat, decreasing = TRUE),c("Country","meanJobSat","sdJobSat")]
CDSmall <- CDSmall[!is.na(CDSmall$Country),] CDSmall$lbjs <- CDSmall$meanJobSat - 1.96*(CDSmall$sdJobSat/(sqrt(length(CDSmall$meanJobSat)))) ## lower bound confidence interval CDSmall$ubjs <- CDSmall$meanJobSat + 1.96*(CDSmall$sdJobSat/(sqrt(length(CDSmall$meanJobSat)))) ## upper bound confidence interval CDSmall <- CDSmall[,c("Country","meanJobSat","lbjs","ubjs")] # order by point CDSmall$Country <- factor(CDSmall$Country, levels=CDSmall[order(CDSmall$meanJobSat, decreasing = FALSE), "Country"])

# figure
p <- ggplot(CDSmall, aes(x=Country, y=meanJobSat, ymin=lbjs, ymax=ubjs))+
geom_pointrange()+
geom_hline(yintercept = 0, linetype=2)+
coord_flip()+
xlab('Country')
p

This is slightly more informative, but lists of countries are still hard to process. Now I graph the proportion of women working in tech in a map.

## GLOBAL MAPS
polygons <- readOGR(dsn = "TM_WORLD_BORDERS_SIMPL-0.3", layer = "TM_WORLD_BORDERS_SIMPL-0.3")
## OGR data source with driver: ESRI Shapefile
## Source: "TM_WORLD_BORDERS_SIMPL-0.3", layer: "TM_WORLD_BORDERS_SIMPL-0.3"
## with 246 features
## It has 11 fields
## Integer64 fields read as strings:  POP2005
# some country names are different in polygons and ks files
levels(polygons$NAME)[levels(polygons$NAME)=="Iran (Islamic Republic of)"] <- "Iran"
levels(polygons$NAME)[levels(polygons$NAME)=="Viet Nam"] <- "Vietnam"

# prep colors for maps
my.palette <- brewer.pal(n = 9, name = "Reds")

polygons2 <- merge(polygons,Countrydescriptives,by.x="NAME",by.y="Country")
breaks.qt <- classIntervals(polygons2$meanfem, n = 9, style = "quantile", intervalClosure = "right") meanfemMap <- spplot(polygons2,"meanfem",main=paste(polygons2$meanfem),sub="Proportion of Women in Tech",
col.regions=my.palette,at=breaks.qt$brks) print(meanfemMap) And a map of average job satisfaction per country polygons3 <- merge(polygons,Countrydescriptives,by.x="NAME",by.y="Country") breaks.qt <- classIntervals(polygons3$meanJobSat, n = 9, style = "quantile", intervalClosure = "right")
JobSatMap <- spplot(polygons3,"meanJobSat",main=paste(polygons3$meanJobSat),sub="Job Satisfaction", col.regions=my.palette,at=breaks.qt$brks)

print(JobSatMap)

In Europe, Canada, and especially South America it looks like there is generally low gender diversity but high job satisfaction. This might be an indication that diversity and job satisfaction are inversely related, but we would need to conduct an analysis to tell for sure.

## Analyses

I will now perform some regressions to see if people are more satisfied with their jobs when they work in less diverse settings.

jobsat.1 <- lm(meanJobSat ~ meanfem, data=Countrydescriptives)
summary(jobsat.1) 
##
## Call:
## lm(formula = meanJobSat ~ meanfem, data = Countrydescriptives)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.10722 -0.25114  0.03638  0.32411  1.03196
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   7.1865     0.1917  37.487   <2e-16 ***
## meanfem      -2.7151     1.1007  -2.467   0.0173 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4865 on 48 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1125, Adjusted R-squared:  0.09401
## F-statistic: 6.085 on 1 and 48 DF,  p-value: 0.01726

The above analysis shows that the greater the proportion of women, the lower the overall level of job satisfaction, but is this driven by women’s job satisfaction or men’s?

# women's job satisfaction
jobsat.2 <- lm(meanfemJobSat ~ meanfem, data=Countrydescriptives)
summary(jobsat.2) 
##
## Call:
## lm(formula = meanfemJobSat ~ meanfem, data = Countrydescriptives)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -2.0723 -0.4764  0.1984  0.5399  1.4476
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   6.6230     0.3366  19.675   <2e-16 ***
## meanfem      -1.3411     1.9327  -0.694    0.491
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8542 on 48 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.009932,   Adjusted R-squared:  -0.01069
## F-statistic: 0.4815 on 1 and 48 DF,  p-value: 0.4911

The proportion of women has no significant effect on women’s mean job satisfaction. As shown in the following analysis the proportion of women seems to decrease men’s mean job satisfaction by 2.5 points, which is a huge amount on a scale of 0-10.

# men's job satisfaction
jobsat.3 <- lm(meanmaleJobSat ~ meanfem, data=Countrydescriptives)
summary(jobsat.3) 
##
## Call:
## lm(formula = meanmaleJobSat ~ meanfem, data = Countrydescriptives)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.27368 -0.27544 -0.03701  0.31443  1.17103
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    7.196      0.206  34.941   <2e-16 ***
## meanfem       -2.452      1.183  -2.073   0.0435 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5227 on 48 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.08221,    Adjusted R-squared:  0.06309
## F-statistic: 4.299 on 1 and 48 DF,  p-value: 0.04352

Is this perhaps due to employers paying women less, thus bringing down the overall salary/reimbursement package for all employees?

# control for income
jobsat.4 <- lm(meanmaleJobSat ~ meanfem + meaninc, data=Countrydescriptives)
summary(jobsat.4) 
##
## Call:
## lm(formula = meanmaleJobSat ~ meanfem + meaninc, data = Countrydescriptives)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.12322 -0.31030 -0.02689  0.39221  1.07163
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  5.31294    0.79897   6.650 2.78e-08 ***
## meanfem     -1.62324    1.17671  -1.379   0.1743
## meaninc      0.17495    0.07194   2.432   0.0189 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4978 on 47 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1848, Adjusted R-squared:  0.1501
## F-statistic: 5.327 on 2 and 47 DF,  p-value: 0.008221

Could be! Indeed, once we control for the mean income of tech workers, the effect of the proportion of women on men’s job satisfaction dissappears. This might imply that employers are discriminating against women by paying them less, and that discrimination brings down the overall salary levels. However, since we are only observing cross-sectional, macro-level phenomena, we can’t say anything about causality.

### Individual level

You’ll notice that in the above analyses, I look at country level averages and find that the average level of job satisfaction is lower in countries where the proportion of women in tech is greater That’s all fine and well, but the question was if individuals experience a tradeoff between job satisfaction and diversity. That means that it’s not enough to look at national averages, we have to consider how gender diversity affects (or is correlated with) individuals’ job satisfaction. That we do with a multilevel analysis.

mljobsat.1 <- lmer(jobsat ~ meanfem + (1 | Country), data=ks)
summary(mljobsat.1) 
## Linear mixed model fit by REML ['lmerMod']
## Formula: jobsat ~ meanfem + (1 | Country)
##    Data: ks
##
## REML criterion at convergence: 26853.8
##
## Scaled residuals:
##      Min       1Q   Median       3Q      Max
## -2.86032 -0.50184  0.07213  0.58402  1.86377
##
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Country  (Intercept) 0.1479   0.3846
##  Residual             4.4752   2.1155
## Number of obs: 6179, groups:  Country, 50
##
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   7.1747     0.1969   36.43
## meanfem      -2.6003     1.1636   -2.23
##
## Correlation of Fixed Effects:
##         (Intr)
## meanfem -0.938

In the above random intercept model we allow the mean job satisfaction per country to vary. We can see from the size of the variance of the intercept (0.1479) relative to its standard deviation (0.3846) that there is not much cross-national variation in job satisfaction. Interesting! (Reported) life satisfaction is known to vary quite a bit between countries. Apparently job satisfaction is less culturally determined, or perhaps tech workers are a particularly international bunch who tend towards one shared global culture.

We can also calculate the interclass correlation, which tells us how similar respondents from the same country are.

(ICC <- 0.1479/(0.1479 + 4.4752))
## [1] 0.03199152

3% of variation in job satisfaction between respondents can be explained by the country level (after controlling for proportion of women working in tech). Incidentally, the variance in job satisfaction explained by country of residence is not much bigger when we don’t include the proportion of women.

The R command lmer() purposefully does not print confidence intervals or p-values out of an ideological objection to basing decisions purely on p-values. Their point is that we should consider the size of the effect as well as its statistical significance. Nonetheless, I find it useful to know if we repeated our test 100 times, how often would our estimated effect be different from 0? We can calculate a confidence interval using the standard error of the estimate, with the result that our confidence interval for the effect of meanfem on jobsat is:

(CI <- c(-2.6003 - 1.96*1.1636 , -2.600 + 1.96*1.1636))
## [1] -4.880956 -0.319344

Thus, people are significantly less satisfied with their jobs when there is a higher proportion of women working in tech. Again we distinguish between men’s satisfaction and women’s:

# women's job satisfaction
mljobsat.2 <- lmer(jobsat ~ meanfem + (1 | Country), data=ks[ks$female==1,]) summary(mljobsat.2)  ## Linear mixed model fit by REML ['lmerMod'] ## Formula: jobsat ~ meanfem + (1 | Country) ## Data: ks[ks$female == 1, ]
##
## REML criterion at convergence: 4023.3
##
## Scaled residuals:
##     Min      1Q  Median      3Q     Max
## -2.6098 -0.6321  0.1232  0.6593  1.7163
##
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Country  (Intercept) 0.1268   0.3561
##  Residual             5.0609   2.2496
## Number of obs: 900, groups:  Country, 50
##
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   6.6839     0.4061  16.458
## meanfem      -1.0444     2.2909  -0.456
##
## Correlation of Fixed Effects:
##         (Intr)
## meanfem -0.959

Here we performed the same analysis but only on the women in our sample. We do see a negative effect of diversity for women, but it’s not significant.

(CI <- c(-1.0444 - 1.96*2.2909 , -1.0444 + 1.96*2.2909))
## [1] -5.534564  3.445764
# men's job satisfaction
mljobsat.3 <- lmer(jobsat ~ meanfem + (1 | Country), data=ks[ks$female==0,]) summary(mljobsat.3)  ## Linear mixed model fit by REML ['lmerMod'] ## Formula: jobsat ~ meanfem + (1 | Country) ## Data: ks[ks$female == 0, ]
##
## REML criterion at convergence: 22509.1
##
## Scaled residuals:
##      Min       1Q   Median       3Q      Max
## -2.90207 -0.50923  0.08274  0.61715  1.86002
##
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Country  (Intercept) 0.1634   0.4042
##  Residual             4.3663   2.0896
## Number of obs: 5207, groups:  Country, 50
##
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   7.1827     0.2084   34.46
## meanfem      -2.3693     1.2414   -1.91
##
## Correlation of Fixed Effects:
##         (Intr)
## meanfem -0.938

For men we also see a negative effect, which is likewise not significant.

(CI <- c(-2.3693 - 1.96*1.2414 , -2.3693 + 1.96*1.2414))
## [1] -4.802444  0.063844

In the country-level analyses we also included income, so let’s do that here, too, but at the individual level.

mljobsat.4 <- lmer(jobsat ~ meanfem + CompUSDLog + (1 | Country), data=ks[ks$female==0,]) summary(mljobsat.4)  ## Linear mixed model fit by REML ['lmerMod'] ## Formula: jobsat ~ meanfem + CompUSDLog + (1 | Country) ## Data: ks[ks$female == 0, ]
##
## REML criterion at convergence: 14758.4
##
## Scaled residuals:
##     Min      1Q  Median      3Q     Max
## -2.9859 -0.5214  0.0794  0.6019  2.1066
##
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Country  (Intercept) 0.1392   0.3731
##  Residual             4.2473   2.0609
## Number of obs: 3434, groups:  Country, 50
##
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  6.06376    0.34351  17.652
## meanfem     -2.75309    1.30205  -2.114
## CompUSDLog   0.11360    0.02514   4.518
##
## Correlation of Fixed Effects:
##            (Intr) meanfm
## meanfem    -0.627
## CompUSDLog -0.778  0.048

It’s a pretty wide standard error again, so let’s compute the confidence interval.

(CI <- c(-2.75309 - 1.96*1.30205 , -2.75309 + 1.96*1.30205))
## [1] -5.305108 -0.201072

Here the proportion of women in tech is significant. Thus while the effect of income seems to overrule the effect of gender diversity as a macro phenomenon, at the individual-level, the positive effect of income actually compensates for some of the negative effect of gender diversity.

### Cross-level interactions

There’s one final way to test whether the gender composition has a different effect for men and women, and that’s with a cross-level interaction. In this model we ask whether the effect of being female varies depending on the proportion of women in tech.

cljobsat <- lmer(jobsat ~ female*meanfem + (1 + female | Country), data=ks)
summary(cljobsat) 
## Linear mixed model fit by REML ['lmerMod']
## Formula: jobsat ~ female * meanfem + (1 + female | Country)
##    Data: ks
##
## REML criterion at convergence: 26530.6
##
## Scaled residuals:
##      Min       1Q   Median       3Q      Max
## -2.86437 -0.50056  0.08275  0.58375  1.92784
##
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr
##  Country  (Intercept) 1.522e-01 0.390106
##           female      5.904e-05 0.007683 -1.00
##  Residual             4.468e+00 2.113824
## Number of obs: 6107, groups:  Country, 50
##
## Fixed effects:
##                Estimate Std. Error t value
## (Intercept)      7.1898     0.2030   35.42
## female          -0.4102     0.3397   -1.21
## meanfem         -2.5410     1.2028   -2.11
## female:meanfem   1.2928     1.8248    0.71
##
## Correlation of Fixed Effects:
##             (Intr) female meanfm
## female      -0.189
## meanfem     -0.939  0.183
## female:mnfm  0.193 -0.974 -0.200

The interaction between female and meanfem is not significant, but the negative effect of gender diversity on job satisfaction remains. I suspect this model is producing wide confidence intervals because we’ve demanded more of it than the data can handle. But for argument’s sake, let’s run the same model above including income in the equation.

cljobsat.2 <- lmer(jobsat ~ female*meanfem + CompUSDLog + (1 + female | Country), data=ks)
summary(cljobsat.2) 
## Linear mixed model fit by REML ['lmerMod']
## Formula: jobsat ~ female * meanfem + CompUSDLog + (1 + female | Country)
##    Data: ks
##
## REML criterion at convergence: 17254.8
##
## Scaled residuals:
##      Min       1Q   Median       3Q      Max
## -2.92262 -0.51164  0.08275  0.60080  2.18590
##
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr
##  Country  (Intercept) 0.1305150 0.36127
##           female      0.0001138 0.01067  -1.00
##  Residual             4.3963805 2.09675
## Number of obs: 3985, groups:  Country, 50
##
## Fixed effects:
##                Estimate Std. Error t value
## (Intercept)     6.13603    0.33068  18.556
## female         -0.83466    0.42770  -1.951
## meanfem        -2.96643    1.27412  -2.328
## CompUSDLog      0.10841    0.02372   4.571
## female:meanfem  3.75169    2.29253   1.636
##
## Correlation of Fixed Effects:
##             (Intr) female meanfm CmUSDL
## female      -0.150
## meanfem     -0.647  0.198
## CompUSDLog  -0.767  0.022  0.058
## female:mnfm  0.137 -0.974 -0.216 -0.001

This model produces the same results as the previous cross-level interaction model, with an additional significant and positive effect of income.

## Cross validation

Let’s now compare our three models (jobsat.4, mljobsat.4, and cljobsat.2) which illustrate the effect of gender diversity on job satisfaction after controlling for income. This i do by segmenting the data into 10 subsets, running the model on each subset, and then correlating predicted and observed values.

# jobsat.4
folds <- cvFolds(NROW(Countrydescriptives), K=10)
Countrydescriptives$pred.4 <- rep(0,nrow(Countrydescriptives)) for(i in 1:10){ train <- Countrydescriptives[folds$subsets[folds$which != i], ] #Set the training set validation <- Countrydescriptives[folds$subsets[folds$which == i], ] #Set the validation set newjobsat.4 <- lm(meanmaleJobSat ~ meanfem + meaninc, data=train) #Get your new linear model (just fit on the train data) predjobsat.4 <- predict(newjobsat.4,newdata=validation) #Get the predicitons for the validation set (from the model just fit on the train data) Countrydescriptives[folds$subsets[folds$which == i], ]$pred.4 <- predjobsat.4 #Put the hold out prediction in the data set for later use
}

cor_jobsat.4 <- cor(Countrydescriptives[,c("meanmaleJobSat", "pred.4")], use="complete.obs", method="pearson")

# mljobsat.4
ksmen <- ks[ks$female==0,] ksmen <- ksmen[!is.na(ksmen$Country),]
foldsksmen <- cvFolds(NROW(ksmen), K=10)
ksmen$predml.4 <- rep(0,nrow(ksmen)) for(i in 1:10){ train <- ksmen[foldsksmen$subsets[foldsksmen$which != i], ] validation <- ksmen[foldsksmen$subsets[foldsksmen$which == i], ] newmljobsat.4 <- lmer(jobsat ~ meanfem + CompUSDLog + (1 | Country), data=train) predmljobsat.4 <- predict(newmljobsat.4,newdata=validation) ksmen[foldsksmen$subsets[foldsksmen$which == i], ]$predml.4 <- predmljobsat.4
}

cor_mljobsat.4 <- cor(ksmen[,c("jobsat", "predml.4")], use="complete.obs", method="pearson")

# cljobsat.2
ks <- ks[!is.na(ks$Country),] foldsks <- cvFolds(NROW(ks), K=10) ks$predcl.2 <- rep(0,nrow(ks))

for(i in 1:10){
train <- ks[foldsks$subsets[foldsks$which != i], ]
validation <- ks[foldsks$subsets[foldsks$which == i], ]

newcljobsat.2 <- lmer(jobsat ~ female*meanfem + CompUSDLog + (1 + female | Country), data=train)
predcljobsat.2 <- predict(newcljobsat.2,newdata=validation)

ks[foldsks$subsets[foldsks$which == i], ]$predcl.2 <- predcljobsat.2 } ## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : unable to evaluate scaled gradient ## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## \$checkConv, : Model failed to converge: degenerate Hessian with 1 negative
## eigenvalues
cor_cljobsat.2 <- cor(ks[,c("jobsat", "predcl.2")], use="complete.obs", method="pearson")

models <- c("jobsat.4","mljobsat.4","cljobsat.2")
correlations <- c(round(cor_jobsat.4[2,1],2),round(cor_mljobsat.4[2,1],2),round(cor_cljobsat.2[2,1],2))
(corres <- cbind(models,correlations))
##      models       correlations
## [1,] "jobsat.4"   "0.26"
## [2,] "mljobsat.4" "0.18"
## [3,] "cljobsat.2" "0.17"

Here I assess accuracy by comparing the correlation between the predicted outcomes based on the regression model and the observed outcomes. The first model, jobsat.4 that only looked at country-level descriptives, has the highest correlation at approximately .30. The individual level models with a simple multilevel structure are less accurate. Therefore, from purely a predictive standpoint, the simplest model performs the best. However, the models also answer different questions. The first model, jobsat.4, describes how gender diversity affects average job satisfaction (it doesn’t), while the other two models, mljobsat.4 and cljobsat.2, describe how gender diversity affects individual job satisfaction (it does–negatively). Thus, the question of which model to use is not purely a question of which model makes better predictions, but also of what question are we trying to answer.

## Conclusion

I started out by asking if there is a tradeoff between gender diversity and job satisfaction, and it does appear that individual job satisfaction is lower in countries where there are more women working in tech, particularly for men. That’s the bad news. The good news is that these models are quite poor predictors of overall job satisfaction. Luckily for employers and policy makers who want to increase gender diversity, there are many other factors that can compensate for its negative effect on job satisfaction.

This post can be found on GitHub.