Loren on the Art of MATLAB

Turn ideas into MATLAB

Survey Reveals Diversity in the “Learn to Code” Movement 6

Posted by Loren Shure,

Do you use any free "learn to code" website to teach yourself programming? You may already know how to program in MATLAB, but you may very well be learning other skills on MOOCs.

Today's guest blogger, Toshi, analyzed a publicly available survey data to understand the demographic of self-taught coders.

Contents

Load Data

I came across a Techcrunch article, Free Code Camp survey reveals demographics of self-taught coders, and I got curious because a lot of people seem to interested in learning how to code, and industry and government are also encouraging this trend. But programming is hard. Who exactly are the kind of people who have taken the plunge? Our own free interactive online programming classes on MATLAB Academy or gamified MATLAB Cody are also gaining popularity and I would like to understand what motivates this interest.

The survey was conducted anonymously and published on the web and promoted via social media from March 28 through May 2, 2016, targeting people who are relatively new to programming.

The following analysis shows significant diversity in gender and ethnic mix among self-taught coders and the possible impact of MOOCs in opening up access for under-served populations by traditional STEM education paths.

I first downloaded the 2016 New Coder Survey result from Github. I then unzipped the CSV files into my current folder. There are two files - part 1 and part 2 - and we will read them into separate tables. We could perhaps merge them using innerjoin, but in this case I am primarily interested in part 2 only and we will be discarding at least 1000 responses from part 1, given the differences in number of responses.

warning('off','MATLAB:table:ModifiedVarnames')              % suppress warning
csv = '2016 New Coders Survey Part 1.csv';                  % filename
part1 = readtable(csv);                                     % read into table
part1.Properties.VariableNames = ...                        % format variable names
    regexprep(part1.Properties.VariableNames,'_+$','');     % by removing extra "_"
csv = '2016 New Coders Part 2.csv';                         % filename
part2 = readtable(csv);                                     % read into table
part2.Properties.VariableNames = ...                        % format variable names
    regexprep(part2.Properties.VariableNames,'_+$','');     % by removing extra "_"
warning('on','MATLAB:table:ModifiedVarnames')               % enable warning
part1.SubmitDate_UTC = datetime(part1.SubmitDate_UTC);      % convert date strings to datetime
part2.SubmitDate_UTC = datetime(part2.SubmitDate_UTC);      % convert date strings to datetime
s = sprintf('part1 %d responses from %s thru %s\n', ...     % summary of part1
    height(part1),datestr(min(part1.SubmitDate_UTC)), ...   % count of responses, start date
    datestr(max(part1.SubmitDate_UTC)));                    % and end date
fprintf('%spart2 %d responses from %s thru %s', ...         % summary of part2
    s,height(part2),datestr(min(part2.SubmitDate_UTC)), ... % count of responses, start date
    datestr(max(part2.SubmitDate_UTC)));                    % and end date
part1 15653 responses from 29-Mar-2016 21:23:43 thru 02-May-2016 19:12:44
part2 14625 responses from 29-Mar-2016 21:25:36 thru 02-May-2016 18:35:59

Higher Female Representation Than Expected

Let's start by plotting a histogram of age distrubution. Loren pointed out we can use the omitnan flag in median to deal with missing values instead of nanmedian.

The histogram shows that a lot of people who responded to this survey fall into the so-called "millenials" category. It is interesting to see the number of women who responded to this survey, considering the often cited gender gap in STEM fields. It is not clear if this reflects the true population or are women are over-represented via self-selection? Or somehow self-teaching programming more appealing to women than traditional instruction?

age = part2.HowOldAreYou;                                   % get age from part2
gender = categorical(part2.What_sYourGender);               % get gender from part2 as categorical
part2.What_sYourGender = gender;                            % update table
figure                                                      % new figure
x = age(age ~= 0 & gender == 'male');                       % subset age by gender
histogram(x)                                                % plot histogram
text(50,550, sprintf('Median Age (Male)   : %d', ...        % annotate
    median(x,'omitnan')))
text(50,470, sprintf('Mode Age (Male)     : %d',mode(x)))   % annotate
hold on                                                     % don't overwrite
x = age(age ~= 0 & gender == 'female');                     % subset age by gender
histogram(x)                                                % plot histogram
text(50,520, sprintf('Median Age (Female): %d', ...         % annotate
    median(x,'omitnan')))
text(50,440, sprintf('Mode Age (Female)  : %d',mode(x)))    % annotate
hold off                                                    % restore default
title('Age Distribution by Gender')                         % add title
xlabel('Age')                                               % add x axis label
ylabel('Count')                                             % add y axis label
legend('Male','Female')                                     % add legend

Mostly Studying in Countries of Citizenship

Since the survey was done online, anyone could participate. Let's check the geographic breakdown. As you would expect, the largest portion of the responses came from the US. You can also see that female responses were 40.59% of male responses in the US, confirming high female representation in the responses.

China is notably missing from the top 10 countries. Perhaps the "learn to code" buzz has not caught on there?

part2.WhichCountryDoYouCurrentlyLiveIn = ...                % convert to categorical
    categorical(part2.WhichCountryDoYouCurrentlyLiveIn);
country = part2.WhichCountryDoYouCurrentlyLiveIn;           % get country of residence
catcount = countcats(country);                              % get category count
cats = categories(country);                                 % get categories
[~, rank] = sort(catcount,'descend');                       % rank category by count
below_top10 = setdiff(cats,cats(rank(1:10)));               % categories below top 10
country = mergecats(country, below_top10, 'Other');         % merge them into other
country = reordercats(country,[cats(rank(1:10));{'Other'}]);% reorder cats by ranking
ratio = sum(country == 'United States of America' & ...     % ratio of female/male in us
    gender == 'female')/sum(country == 'United States of America' & gender == 'male');
figure                                                      % new figure
histogram(country(gender == 'male'))                        % plot histogram
hold on                                                     % don't overwrite
histogram(country(gender == 'female'))                      % plot histogram
hold off                                                    % restore default
ax = gca;                                                   % get current axes handle
ax.XTickLabelRotation = 90;                                 % rotate x tick label
title('Country of Residence by Gender')                     % add title
ylabel('Count')                                             % add y axis label
legend('Male','Female')                                     % add legend
text(1.5, 1900, sprintf('US Female/Male %.2f%%',ratio*100)) % annotate

You can also visualize migration patterns by mapping countries of citizenship to countries of residence. The number of edges are just 467 - meaning only 467 out of all 14,625 responses in part2 are from migrants, and most people live and study in their countries of citizenship. If you take the ratio of immigration over emigration, US, United Kingdom, Canada, Australia Germany and Russia enjoy net gains from any brain drain.

part2.WhichCountryAreYouACitizenOf = ...                    % convert to categorical
    categorical(part2.WhichCountryAreYouACitizenOf);
citizenship = part2.WhichCountryAreYouACitizenOf;           % get country of citizenship
tbl = table(cellstr(citizenship),cellstr(country));         % create table of residence and citizenship
tbl(isundefined(citizenship) & isundefined(country),:) = [];% drop empty rows
tbl.(1)(strcmp(tbl.(1),'<undefined>')) = ...                % use residence if citizenship is emtpy
    tbl.(2)(strcmp(tbl.(1),'<undefined>'));
tbl.(2)(strcmp(tbl.(2),'<undefined>')) = ...                % use citizenship if residence is emtpy
    tbl.(1)(strcmp(tbl.(2),'<undefined>'));
[tbl, ~, idx] = unique(tbl,'rows');                         % eliminate duplicate rows
w = accumarray(idx, 1);                                     % use count of duplicates as weight
G = digraph(tbl.(1), tbl.(2), w);                           % create a directed graph
indeg = indegree(G);                                        % get in-degrees
ratio = indeg./outdegree(G);                                % get ratio of in-degrees over out-degrees
figure                                                      % new figure
colormap cool                                               % set colormap
w = G.Edges.Weight;                                         % get weights
h = plot(G,'MarkerSize',log(indeg+2),'NodeCData',ratio, ... % plot directional graph
    'EdgeColor',[.7 .7 .7],'EdgeAlpha',0.3,'LineWidth',10*w/max(w));
caxis([0 3])                                                % set color axis scaling
axis([-2.8 3.3 -4.5 3.7])                                   % set axis limits
title({'Migration Pattern'; ...                             % add title
    '467 cases out of 14,625 responses (3.2%)'})
labelnode(h,cats(rank(1:10)),cats(rank(1:10)))              % label top 10 nodes
nlabels = {'Argentina','Azerbaijan','Chile','Congo', ...    % additional nodes to label
    'Cote D''Ivoire','Croatia','Greece','Guyana','Latvia','Lesotho', ...
    'Malta','Other','Paraguay','Philippines','Republic of Serbia','Romania'};
labelnode(h,nlabels,nlabels);                               % label additional nodes
h = colorbar;                                               % add colorbar
ylabel(h, 'in-degrees over out-degree ratio')               % add metric

Ethnically Diverse English Speakers in US

Let's focus on the US. As noted earlier, most new self-taught coders who responded to this survey were US citizens, but its ethnic makeup is very diverse. More than half of the women are ethnic minorities, and so are 1/3 of men. They are also predominantly English speakers, given the low ratio of immigrants. However, we should note that the survey itself was in English and promoted via social media in English.

isminority = part2.AreYouAnEthnicMinorityInYourCountry;     % get monority status
figure                                                      % new figure
subplot(1,2,1)                                              % create a subplot
x = gender(country == 'United States of America' ...        % subset
    & isminority == 0);                                     % us non-minority
histogram(x)                                                % plot histogram
hold on                                                     % don't overwrite
x = gender(country == 'United States of America' ...        % subset gender
    & isminority == 1);                                     % by us minority
histogram(x)                                                % plot histogram
hold off                                                    % restore default
title('US Gender by Ethnic Category')                       % add title
ylabel('Count')                                             % add y axis label
legend('Majority','Minority', 'Location','northwest')       % add legend
part2.WhichLanguageDoYouYouSpeakAtHomeWithYourFamily = ...  % convert to categorical
    categorical(part2.WhichLanguageDoYouYouSpeakAtHomeWithYourFamily);
lanuage = part2.WhichLanguageDoYouYouSpeakAtHomeWithYourFamily;% get language
usa = lanuage(country == 'United States of America');       % extract us data
catcount = countcats(usa);                                  % get category count
cats = categories(usa);                                     % get categories
[~, rank] = sort(catcount,'descend');                       % rank category by count
below_top10 = setdiff(cats,cats(rank(1:10)));               % categories below top 10
usa = mergecats(usa, below_top10, 'Other');                 % merge them into other
usa = reordercats(usa,[cats(rank(1:10)); {'Other'}]);       % reorder cats by ranking
ax = gca;                                                   % get current axes handle
ax.XTickLabelRotation = 90;                                 % rotate x tick label
subplot(1,2,2)                                              % create a subplot
histogram(usa(gender(country == ...                         % plot histogram
    'United States of America') == 'male'))                 % subset us language by gender
hold on                                                     % don't overwrite
histogram(usa(gender(country == ...                         % plot histogram
    'United States of America') == 'female'))               % subset us language by gender
hold off                                                    % restore default
title('US Languages by Gender')                             % add title
ylabel('Count')                                             % add y axis label
legend('Male','Female')                                     % add legend

Many Are Highly Educated and Already Employed in the US

We already know that a lot of people who take MOOCs had already earned college degrees and have jobs. This survey also shows the same result.

part2.What_sTheHighestDegreeOrLevelOfSchoolYouHaveCompleted = ... % convert to categorical
    categorical(part2.What_sTheHighestDegreeOrLevelOfSchoolYouHaveCompleted);
degree = part2.What_sTheHighestDegreeOrLevelOfSchoolYouHaveCompleted;% get degree
usa = degree(country == 'United States of America');        % extract us data
catcount = countcats(usa);                                  % get category count
cats = categories(usa);                                     % get categories
[~, rank] = sort(catcount,'descend');                       % rank category by count
usa = reordercats(usa,cats(rank));                          % reorder cats by ranking
figure                                                      % new figure
subplot(1,2,1)                                              % create a subplot
histogram(usa(gender(country == ...                         % plot histogram
    'United States of America') == 'male'))                 % subset us degree by gender
hold on                                                     % don't overwrite
histogram(usa(gender(country == ...                         % plot histogram
    'United States of America') == 'female'))               % subset us degree by gender
hold off                                                    % restore default
title('US Degrees by Gender')                               % add title
ylabel('Count')                                             % add y axis label
legend('Male','Female')                                     % add legend
part2.RegardingEmploymentStatus_AreYouCurrently = ...       % convert to categorical
    categorical(part2.RegardingEmploymentStatus_AreYouCurrently);
employment = part2.RegardingEmploymentStatus_AreYouCurrently;% get employment
other = part2.Other;                                        % get other in employment
isstudent = zeros(size(other));                             % set up an accumulator
fun = @(x,y) ~cellfun(@isempty,strfind(lower(x),y));        % anonymous function handle
isstudent(fun(other,'student')) = 1;                        % flag if 'studnet' is found
isstudent(fun(other,'studying')) = 1;                       % flag if 'studying' is found
isstudent(fun(other,'school')) = 1;                         % flag if 'school' is found
isstudent(fun(other,'university')) = 1;                     % flag if 'university' is found
isstudent(fun(other,'degree')) = 1;                         % flag if 'degree' is found
isstudent(fun(other,'phd')) = 1;                            % flag if 'phd' is found
employment(logical(isstudent)) = 'Student';                 % update employment
usa = employment(country == 'United States of America');    % extract us data
catcount = countcats(usa);                                  % get category count
cats = categories(usa);                                     % get categories
[~, rank] = sort(catcount,'descend');                       % rank category by count
usa = reordercats(usa,cats(rank));                          % reorder cats by ranking
subplot(1,2,2)                                              % create a subplot
histogram(usa(gender(country == ...                         % plot histogram
    'United States of America') == 'male'))                 % subset us employment by gender
hold on                                                     % don't overwrite
histogram(usa(gender(country == ...                         % plot histogram
    'United States of America') == 'female'))               % subset us employment by gender
hold off                                                    % restore default
title('US Employment by Gender')                            % add title
ylabel('Count')                                             % add y axis label
legend('Male','Female')                                     % add legend

Many Already Work In Software Development and IT in US

It turns out that many respondents already work in software development and IT fields and come from very diverse acadamic backgrounds, including both STEM as well as non-STEM subjects. Since the proportion of women tends to be higher in non-STEM majors, this may explain why we see higher than expected female representation in this survey. It appears that female respondents who studied non-STEM majors in undergraduate are now pursuing a career in software development.

Curiously, we also see many computer science majors and they tend to be men. Why are people who already have a computer science background pursuing self-teaching programming? Shouldn't they have learned it in school?

part2.WhichFieldDoYouWorkIn = ...                           % convert to categorical
    categorical(part2.WhichFieldDoYouWorkIn);
job = part2.WhichFieldDoYouWorkIn;                          % get job
job = mergecats(job, {'software development and IT', ...    % merge similar categories
    'software development'});
us_job = job(country == 'United States of America');        % extract us data
catcount = countcats(us_job );                              % get category count
cats = categories(us_job);                                  % get categories
[~, rank] = sort(catcount,'descend');                       % rank category by count
below_top10 = setdiff(cats,cats(rank(1:10)));               % categories below top 10
us_job = mergecats(us_job, below_top10, 'Other');           % merge them into other
us_job = reordercats(us_job,[cats(rank(1:10)); {'Other'}]); % reorder cats by ranking
figure                                                      % new figure
subplot(1,2,1)                                              % create a subplot
histogram(us_job(gender(country == ...                      % plot histogram
    'United States of America') == 'male'))                 % subset us subject by gender
hold on                                                     % don't overwrite
histogram(us_job(gender(country == ...                      % plot histogram
    'United States of America') == 'female'))               % subset us subject by gender
hold off                                                    % restore default
title('US Job Field by Gender')                             % add title
ylabel('Count')                                             % add y axis label
legend('Male','Female')                                     % add legend
part2.WhatWasTheMainSubjectYouStudiedInUniversity = ...     % convert to categorical
    categorical(part2.WhatWasTheMainSubjectYouStudiedInUniversity);
major = part2.WhatWasTheMainSubjectYouStudiedInUniversity;  % get academic major
us_maj = major(country == 'United States of America');      % extract us data
catcount = countcats(us_maj);                               % get category count
cats = categories(us_maj);                                  % get categories
[~, rank] = sort(catcount,'descend');                       % rank category by count
below_top10 = setdiff(cats,cats(rank(1:10)));               % categories below top 10
us_maj = mergecats(us_maj, below_top10, 'Other');           % merge them into other
us_maj = reordercats(us_maj,[cats(rank(1:10)); {'Other'}]); % reorder cats by ranking
subplot(1,2,2)                                              % create a subplot
histogram(us_maj(gender(country == ...                      % plot histogram
    'United States of America') == 'male'))                 % subset us subject by gender
hold on                                                     % don't overwrite
histogram(us_maj(gender(country == ...                      % plot histogram
    'United States of America') == 'female'))               % subset us subject by gender
hold off                                                    % restore default
title('US Academic Major by Gender')                        % add title
ylabel('Count')                                             % add y axis label
legend('Male','Female', 'Location', 'northwest')            % add legend

Here is a quick sanity check. Yes, those computer science majors do seem to work in software development and IT, as indicated by the line width of the edge between them.

tbl = table(cellstr(us_maj),cellstr(us_job));               % create table of us major and us job
tbl(isundefined(us_maj) | isundefined(us_job) | ...         % remove undefined
    us_maj == 'Other',:)= [];                               % remove 'Other' from us major
[tbl, ~, idx] = unique(tbl,'rows');                         % eliminate duplicate rows
w = accumarray(idx, 1);                                     % use count of duplicates as weight
G = digraph(tbl.(1), tbl.(2), w);                           % create a directed graph
figure                                                      % new figure
w = G.Edges.Weight;                                         % get weights
h = plot(G,'Layout','layered','LineWidth',5*w/max(w));      % plot the directed graph
xlim([-2.5 16])                                             % x-axis limits
highlight(h, unique(tbl.(2)),'NodeColor',[.85 .33 .1])      % highlight job nodes
title({'US Majors vs. Job Fields'; ...                      % add title
    'Line Width Varies by Frequency'})
text(-2, 2, 'Majors', 'FontWeight','Bold')                  % annotate
text(-2, 1, 'Job Fields', 'FontWeight','Bold')              % annotate
annotation('arrow',[.2 .2],[.75 .25], 'Color',[0 .45 .75])  % annotate

Academic Background in Software Development and IT

Taking a deep dive in Software Development and IT, we see that not everyone that is a computer science major and has a diverse academic background is represented in the industry, and they are probably in different career tracks based on their background. Women represent a higher proportion in English, Psychology and Other.

us_maj_it = major(country == 'United States of America' ... % subset major by country
    & job == 'software development and IT');                % and job
catcount = countcats(us_maj_it);                            % get category count
cats = categories(us_maj_it);                               % get categories
[~, rank] = sort(catcount,'descend');                       % rank category by count
below_top10 = setdiff(cats,cats(rank(1:10)));               % categories below top 10
us_maj_it = mergecats(us_maj_it, below_top10, 'Other');     % merge them into other
us_maj_it = reordercats(us_maj_it,[cats(rank(1:10)); {'Other'}]);% reorder cats by ranking
figure                                                      % new figure
histogram(us_maj_it(gender(country == ...                   % plot histogram
    'United States of America' & ...                        % subset us subject by job
    job == 'software development and IT') == 'male'))       % and gender
hold on                                                     % don't overwrite
histogram(us_maj_it(gender(country == ...                   % plot histogram
    'United States of America' & ...                        % subset us subject by job
    job == 'software development and IT') == 'female'))     % and gender
hold off                                                    % restore default
ax = gca;                                                   % get current axes handle
ax.XTickLabelRotation = 90;                                 % rotate x tick label
title('US Software Development and IT - Majors by Gender')  % add title
ylabel('Count')                                             % add y axis label
legend('Male','Female', 'Location', 'northwest')            % add legend

Wide Income Gap in Software Development and IT

Let's compare the income range by Job Field using a box plot. The bottom and top of the box represents the first and third quantiles and the middle red line represents the median, and whiskers represent +/- 2.7 standard deviations. Red "+"s show the outliers.

Compared to other job fields, income range in software development and IT has a wide spread (as indicated by the elongnated box shape and longer whisker), meaning there is good upside potential to do better.

income = part2.AboutHowMuchMoneyDidYouMakeLastYear_inUSDollars;% get income
income = str2double(income);                                % convert to numeric
income(income == 0) = NaN;                                  % don't count zero
us_income = income(country == 'United States of America');  % extract us data
figure                                                      % new figure
boxplot(us_income,us_job)                                   % create a box plot
ylim([0 2*10^5])                                            % set upper limit
title('US Income Distribution by Job Field')                % add title
ax = gca;                                                   % get current axes handle
ax.XTickLabelRotation = 90;                                 % rotate x tick label
ax.YTickLabel = {'$0','$50k','$100k','$150k','$200k'};      % set y tick label
ylabel('Annual Income')                                     % add y axis label

What Affects Income in Software Development and IT?

The first factor that may affect the income in software development and IT is academic background. The box plot shows that Computer Science and Electric Engineering give you the most advantage in getting a higher salary. This is probably the motivation behind the self-learning, how-to-code trend - people want to switch to a more lucrativie career path from their current path or advance more quickly within the same industry.

us_income_it = income(country == 'United States of America' ...% subset income by country
    & job == 'software development and IT');                % and job
figure                                                      % new figure
boxplot(us_income_it,us_maj_it)                             % create a box plot
ylim([0 2*10^5])                                            % set upper limit
title({'US Income Distribution by Major', ....              % add title
    'in Software Development and IT'})
ax = gca;                                                   % get current axes handle
ax.XTickLabelRotation = 90;                                 % rotate x tick label
ax.YTickLabel = {'$0','$50k','$100k','$150k','$200k'};      % set y tick label
ylabel('Annual Income')                                     % add y axis label

Age Factor

Another important factor to consider is age. If you plot the age against income in Software Development and IT, you see a wide income gap among younger Computer Science majors. Some 25 year-olds can be earning $0 vs.$110,000 a year. Income also seems to plateau as you age. You can use fit with the exp2 option to apply a two-term exponential curve to the data so you can see it easily. Perhaps this provides motivation for CS majors to improve their skills and experience as quickly as possible?

us_age_it = age(country == 'United States of America' ...   % subset age by country
    & job == 'software development and IT');                % and job
X = us_age_it(us_maj_it == 'Computer Science');             % subset just CS
Y = us_income_it(us_maj_it == 'Computer Science');          % subset just CS
figure                                                      % new figure
plot(X, Y,'o')                                              % plot data
hold on                                                     % don't overwrite
missingrows = isnan(X) | isnan(Y);                          % find NaNs
X(missingrows) = [];                                        % remove NaNs
Y(missingrows) = [];                                        % remove NaNs
fitresult = fit(X,Y,'exp2');                                % fit to exp2
plot(fitresult)                                             % plot curve
hold off                                                    % restore default
title({'Income by Age Among CS Majors'; ...                 % add title
    'in Software Development and IT'})
xlim([10 60])                                               % set x axis limits
ylim([0 2*10^5])                                            % set y axis limits
ax = gca;                                                   % get current axes handle
ax.YTick = 0:50000:200000;                                  % set y tick
ax.YTickLabel = {'$0','$50k','$100k','$150k','$200k'};      % set y tick label
xlabel('Age')                                               % add x axis label
ylabel('Annual Income')                                     % add y axis label

Big Companies Not Preferred

If you look at the future employment those people are looking for, they prefer big companies the least. Since people are not earning a formal degree, they are probably expecting more flexibility in other employment options.

The most popular option is mid-sized companies, but many people are interested in working for a startup, or starting their own businesses, or freelancing. People in software development and IT tend to prefer working for startup or mid-sized companies more. Men tend to prefer doing their own businesses or work for a startup, while women tend to prefer the freelance path.

part2.want_employment_type = ...                            % convert to categorical
    categorical(part2.want_employment_type);
interested_emp = part2.want_employment_type;                % get interested employment type
figure                                                      % new figure
subplot(1,2,1)                                              % create a subplot
us_int_emp_it = interested_emp(country == ...               % subset it by country and job
    'United States of America' & job == 'software development and IT');
us_int_emp_it(isundefined(us_int_emp_it)) = [];             % remove undefined
us_int_emp_it = removecats(us_int_emp_it);                  % remove unused categories
histogram(us_int_emp_it,'Normalization','probability')      % plot histogram
hold on                                                     % don't overwrite
us_int_emp_non_it = interested_emp(country == ...           % subset it by country and job
    'United States of America' & job ~= 'software development and IT');
us_int_emp_non_it(isundefined(us_int_emp_non_it)) = [];     % remove undefined
us_int_emp_non_it = removecats(us_int_emp_non_it);          % remove unused categories
histogram(us_int_emp_non_it,'Normalization','probability')  % plot histogram
hold off                                                    % restore default
title({'US Desired Employment Type';'by Job Field'})        % add title
ax = gca;                                                   % get current axes handle
ax.XTickLabelRotation = 90;                                 % rotate x tick label
ax.YTick = 0:0.1:0.6;                                       % set y tick
ax.YTickLabel = {'0%','10%','20%','30%','40%','50%','60%'}; % set y tick label
legend('Software Dev and IT', 'Others')                     % add legend
ylim([0 0.6])                                               % set y axis limits
subplot(1,2,2)                                              % create a subplot
us_int_emp_m = interested_emp(country == ...                % subset it by country
    'United States of America' & gender == 'male');         % gender
us_int_emp_m(isundefined(us_int_emp_m)) = [];               % remove undefined
us_int_emp_m = removecats(us_int_emp_m);                    % remove unused categories
histogram(us_int_emp_m,'Normalization','probability')       % plot histogram
hold on                                                     % don't overwrite
us_int_emp_f = interested_emp(country == ...                % subset it by country
    'United States of America' & gender == 'female');       % gender
us_int_emp_f(isundefined(us_int_emp_f)) = [];               % remove undefined
us_int_emp_f = removecats(us_int_emp_f);                    % remove unused categories
histogram(us_int_emp_f,'Normalization','probability')       % plot histogram
hold off                                                    % restore default
title({'US Desired Employment Type';'by Gender'})           % add title
ax = gca;                                                   % get current axes handle
ax.XTickLabelRotation = 90;                                 % rotate x tick label
ax.YTick = 0:0.1:0.6;                                       % set y tick
ax.YTickLabel = {'0%','10%','20%','30%','40%','50%','60%'}; % set y tick label
legend('Male', 'Female')                                    % add legend
ylim([0 0.6])                                               % set y axis limits

Dream Jobs

When it comes to actual jobs people are interested in, they are mostly web development positions. People already in software development and IT tend to prefer roles with higher technical skills - Back-End Web Development, DevOps, or SysAdmin rather than Front-End Web Development, or other non-development roles such as Product Manager or QA Engineer. In terms of gender, women tend to prefer Front-End Web Development and User Experience Design.

int_job = categorical(strtrim(part2.jobs_interested_in));   % get interested job
catcount = countcats(int_job);                              % get category count
cats = categories(int_job);                                 % get categories
[~, rank] = sort(catcount,'descend');                       % rank category by count
below_top10 = setdiff(cats,cats(rank(1:10)));               % categories below top 10
int_job = mergecats(int_job, below_top10,'Other');          % merge them into other
int_job = reordercats(int_job,[cats(rank(1:10));{'Other'}]);% reorder cats by ranking
figure                                                      % new figure
subplot(1,2,1)                                              % create a subplot
us_int_job_non_it = int_job(country == ...                  % subset int job by country and job
     'United States of America' & job ~= 'software development and IT');
histogram(us_int_job_non_it, 'Normalization','probability') % plot histogram
hold on                                                     % don't overwrite
us_int_job_it = int_job(country == ...                      % subset int job by country and job
     'United States of America' & job == 'software development and IT');
histogram(us_int_job_it, 'Normalization','probability')     % plot histogram
hold off                                                    % restore default
title({'US Jobs Interested In';'By Job Field'})             % add title
legend('All Others','Software Dev & IT')                    % add legend
ax = gca;                                                   % get current axes handle
ax.YTick = 0:0.1:0.5;                                       % set y tick
ax.YTickLabel = {'0%','10%','20%','30%','40%','50%'};       % set y tick label
ylim([0 0.5])                                               % set y axis limits
subplot(1,2,2)                                              % create a subplot
us_int_job_m = int_job(country == ...                       % subset int job by country
     'United States of America' & gender == 'male');        % and gender
histogram(us_int_job_m, 'Normalization','probability')      % plot histogram
hold on                                                     % don't overwrite
us_int_job_f = int_job(country == ...                       % subset int job by country
     'United States of America' & gender == 'female');      % and gender
histogram(us_int_job_f, 'Normalization','probability')      % plot histogram
hold off                                                    % restore default
title({'US Jobs Interested In';'By Gender'})                % add title
legend('Male','Female')                                     % add legend
ax = gca;                                                   % get current axes handle
ax.YTick = 0:0.1:0.5;                                       % set y tick
ax.YTickLabel = {'0%','10%','20%','30%','40%','50%'};       % set y tick label
ylim([0 0.5])                                               % set y axis limits

Student Loan Debt

The survey also answers how much student loan debt respondents carry and how much they spend learning to code. Over 41% of the respondents have student loan debt and the median amount owed is $25,000. In addition, people do spend during the course of learning to code, and the median total spend is $300. Given that a lot of people have debt, they cannot afford to spend more and add to their deficit, and that also reflects on the more conservative choices in future employment.

debt = part2.AboutHowMuchDoYouOweInStudentLoans_inUSDollars;% get student load debt
debt = str2double(debt);                                    % convert to numeric
debt(debt == 0) = NaN;                                      % don't count zero
us_debt = debt(country == 'United States of America');      % extract us data
pct_in_debt = sum(~isnan(us_debt))/length(us_debt)*100;     % percentage in debt
median_debt = nanmedian(us_debt)/1000;                      % median debt
figure                                                      % new figure
subplot(1,2,1)                                              % create a subplot
histogram(us_debt)                                          % plot histogram
xlim([0 2*10^5])                                            % set y axis limits
ax = gca;                                                   % get current axes handle
ax.XTick = 0:50000:200000;                                  % set x tick
ax.XTickLabel = {'$0','$50k','$100k','$150k','$200k'};      % set x tick label
xlabel('Amount Owed')                                       % add x axis label
ylabel('Count')                                             % add y axis label
title({'US Student Loan Debt'; ...                          % add title
    sprintf('%.2f%% in Debt (Median $%dk)',pct_in_debt,median_debt)})
subplot(1,2,2)                                              % create a subplot
spend = part2.total_spent_learning;                         % get total spend
spend = str2double(spend);                                  % convert to numeric
spend(spend == 0) = NaN;                                    % don't count zero
us_spend = spend(country == 'United States of America');    % extract us data
histogram(us_spend)                                         % plot histogram
xlim([0 3*10^4])                                            % set y axis limits
ax = gca;                                                   % get current axes handle
ax.XTick = 0:10000:30000;                                   % set x tick
ax.XTickLabel = {'$0','$10k','$20k','$30k'};                % set x tick label
xlabel('Total Spend')                                       % add x axis label
ylabel('Count')                                             % add y axis label
title({'US Spend Learning'; sprintf('Median $%d', ...       % add title
    nanmedian(us_spend))})

Women Prefer More Welcoming Venues

This survey seems to show more female participation in the "learn to code" movement compared to a more tranditional computer science education. When you look at the type of events women prefer, they show strong preference for gender-specific events like "Girl Develop It" and "Women Who Code". When you look at the online resources, you don't see much difference by gender. It appears that physical presence of males makes women feel unwelcome.

events_attended = part2.attended_event_types;               % get events attended
events_attended = cellfun(@(x) strsplit(x,','), ...         % split by comma
    events_attended,'UniformOutput',false);
events_attended_flatten = strtrim([events_attended{:}]);    % un-nest and trim
[~,ia,ib] = unique(lower(events_attended_flatten));         % get indices of uniques
events = events_attended_flatten(ia);                       % get unique values
count = accumarray(ib,1);                                   % count unique values
events(count < 100) = [];                                   % drop unpopular events
events(strcmpi(events,'none')) = [];                        % drop 'none'
events(cellfun(@isempty,events)) = [];                      % drop empty cell
attended = zeros(size(events_attended,1),length(events));   % set up accumulator
for i = 1:size(events_attended,1)                           % loop over events attended
    attended(i,:) = ismember(events,strtrim( ...            % find intersection between
        events_attended{i}));                               % events and attended events
end
attended_m = sum(attended(country == ...                    % subset attended by country
    'United States of America' & gender == 'male',:));      % and gender
attended_f = sum(attended(country == ...                    % subset attended by country
    'United States of America' & gender == 'female',:));    % and gender
gender_ratio = attended_m ./ sum(attended_m);               % get male ratio by event
gender_ratio = [gender_ratio; attended_f./sum(attended_f)]; % add female ratio
figure                                                      % new figure
subplot(1,2,1)                                              % create a subplot
b = bar(gender_ratio','FaceColor',[0 .45 .75], ...          % create a bar chart
    'FaceAlpha',.6);                                        % with histogram colors
b(2).FaceColor = [.85 .33 .1];                              % with histogram colors
ax = gca;                                                   % get current axes handle
ax.XTickLabel = events;                                     % set x tick label
ax.XTickLabelRotation = 90;                                 % rotate x tick label
ax.YTick = 0:0.1:0.4;                                       % set y tick
ax.YTickLabel = {'0%','10%','20%','30%','40%'};             % set y tick label
title('US Popular Events Attended')                         % add title
legend('Male','Female')                                     % add legend
subplot(1,2,2)                                              % create a subplot
resources_used = part2.learning_resources;                  % get resources used
resources_used = cellfun(@(x) strsplit(x,','), ...          % split by comma
    resources_used,'UniformOutput',false);
resources_used_flatten = strtrim([resources_used{:}]);      % un-nest and trim
[~,ia,ib] = unique(lower(resources_used_flatten));          % get indices of uniques
resources = resources_used_flatten(ia);                     % get unique values
count = accumarray(ib,1);                                   % count unique values
resources(count < 100) = [];                                % drop unpopular resources
resources(cellfun(@isempty,resources)) = [];                % drop empty cell
usage = zeros(size(resources_used,1),length(resources));    % set up accumulator
for i = 1:size(resources_used,1)                            % loop over resources used
    usage(i,:) = ismember(resources,strtrim( ...            % find intersection between
        resources_used{i}));                                % resources and resource used
end
usage_m = sum(usage(country == ...                          % subset usage by country
    'United States of America' & gender == 'male',:));      % and gender
usage_f = sum(usage(country == ...                          % subset usage by country
    'United States of America' & gender == 'female',:));    % and gender
gender_ratio = usage_m ./ sum(usage_m);                     % get male ratio by resource
gender_ratio = [gender_ratio; usage_f ./ sum(usage_f)];     % add female ratio
b = bar(gender_ratio','FaceColor',[0 .45 .75], ...          % create a bar chart
    'FaceAlpha',.6);                                        % with histogram colors
b(2).FaceColor = [.85 .33 .1];                              % with histogram colors
ax = gca;                                                   % get current axes handle
ax.XTickLabel = resources;                                  % set x tick label
ax.XTickLabelRotation = 90;                                 % rotate x tick label
ax.YTick = 0:0.1:0.3;                                       % set y tick
ax.YTickLabel = {'0%','10%','20%','30%'};                   % set y tick label
title('US Popular Resources Used')                          % add title
legend('Male','Female')                                     % add legend

To give an example, I encouraged my daughter to join a robotics competition team in her high school. She talked to her friends because she didn't want to be only girl in the team and a bunch of girls joined the team. When she came home from the first team session, I asked her what she worked on. She said "we worked on team web page". It turned out boys worked on the building robots and girls were left out, so they worked on building the team web page. When the kits were delivered to the team, boys just huddled togather among themselves, and didn't bother to include girls. Girls were not consciously excluded, but they felt unwelcome anyway. I suspect similar dynamics may be at play which coding events women go to.

I also wonder the female preference of Front-End Web Development and User Experience Design is also driven by the same issue?

Summary

Perhaps the most intriguing result of this analysis is that the "learn to code" movement is effective in closing the gender gap in software development and IT and embraced by the minority community under-served by the traditional educational paths. It also underscores the precarious positions those learners face due to the high student loan debt they carry. Ultimately we don't know how many of them actually achieve employment in their dream job from this survey, and hopefully there is a follow-up to find out whether the "learn to code" movement really delivers on its promise.

Do you use any of those "learn to code" websites or other MOOCs? What are you learning and what motivates you to take those classes? Please share your experience here!


Get the MATLAB code

Published with MATLAB® R2016a

Note

Comments are closed.

6 CommentsOldest to Newest

Hyder replied on : 1 of 6
I get an error in the line for reading the table. K>> readtable('2016 New Coders Survey Part 1.csv') Error using readtable (line 129) Each line of a text file must have the same number of delimiters. However it works when I use import tool to read the .csv file as a table
Toshi replied on : 2 of 6
Thanks for your comment, Hyder. Which version are you using? readtable in R2016a "automatically detects delimiters, header lines, and variable names in text files" according to the release note, and that's the version I am using. You may need to pass additional format parameter in an earlier version or use import tool as you say.
Kris replied on : 3 of 6
Hi Loren, for the gender question did you mean to calculate the ratio using female / male? Seems that it should be female / total.
Toshi replied on : 4 of 6
Hi Kris, yes, I meant to use female/male ratio, rather than female/total ratio, because question I was interested in was not "what percentage of total responses came from women" but "what was the proportion of female responses compared to male responses" as a way to measure how big that proportion was. I could have done it the way you suggest, and that may have been easier to follow, but that was my thinking.
Alexis Chan replied on : 5 of 6
Thanks for the analysis! Some constructive feedback: 1. The histograms with long text labels would be more readable if they are horizontal histograms. 2. Not sure how easy it would be to use chloropleth maps for the migration pattern graph. A quick search gives me this http://www.mathworks.com/help/map/create-chloropleth-map-of-population-density.html Thanks again for the analysis - it helps to have examples
Toshi replied on : 6 of 6
Thanks Alexis for your feedback. Your suggestions make sense! If you come up with a chloropleth map, can you share it?