2020 State of Haskell survey: an attempt at cluster analysis

The State of Haskell is a survey which goal is to sample the Haskell community’s demographics, tooling preferences, opinions on the language and more. It has been devised — and run annually since 2017 — by Taylor Fausak. Every November, the results are published online (2017, 2018, 2019, 2020) in form of descriptive statistics and plots.

This year I decided to attempt cluster analysis on the data.

(If you are only interested in the results, skip to paragraph From clusters to opinions)

Goals and method

The analysis will proceed in two steps:

  1. check whether it is possible to segment the population of Haskell users into cohesive and mutually exclusive clusters (students, professionals, academics, …);
  2. once we have a result from 1., examine how the distinct segments differ in their opinions on the community/language (does Haskell’s performance meet the needs of academics? Do Windows users think Haskell libraries are well documented?).

As the input for the algorithm, I selected 13 questions/variables from the survey which I feel capture various elements of what «being an Haskeller» means:

☞ NB: due to a bug in the 2020 survey, some entries (763) were not correctly recorded. I excluded those and worked with the rest (N=603).

Finding the clusters

Since most of the variables are categorical or ordinal, we first compute the dissimilarity matrix (taking two tuples pairwise, how different are those?). The computation allows us to plot a hierarchical dendrogram, where we start at the top with a set all of surveys and we progressively reach a finer granularity at the bottom.

The higher the vertical distance from the parent node, the bigger the dissimilarity. We have a fairly spaced tree, splitting it in five branches seems a natural decision.

The green and teal groups hang a long distance from their parents, which means they are most likely unique segments much different from their neighbours. We will temporarily label the clusters in Roman numerals, I, II, III, IV and V.

groups
  I  II III  IV   V
224  40  76 200  63

Naming the clusters

Now that we have sliced the clusters, we need to inspect them to see what is inside and what variables actually helped discrimination. As an example, Haskellers who use Haskell at work «most of the time» are almost all intercepted in one cluster (IV):

but it is difficult to see any obvious differences in region division:

As an exploratory tool, let us print the statistical mode for each question/cluster pair:

* How old are you?
I       25 to 34 years old
II      25 to 34 years old
III     18 to 24 years old    # ← slightly younger
IV      25 to 34 years old
V       25 to 34 years old

* What is your gender?
I       Male  
II      Male  
III     Male  
IV      Male  
V       Male  

* region
I       Europe & Central Asia
II      Europe & Central Asia
III     Europe & Central Asia
IV      Europe & Central Asia
V       Europe & Central Asia


# As expected, the demographics of Haskell users is not that varied.
# Cluster III sports lower age. Are they students by any chance?


* Are you a student?
I       No            
II      No            
STU     Yes, full time    # Yes they are, we will relabel the cluster
IV      No                # `STU` for easier recognition.
V       No            

* Do you use Haskell?
I       Yes              
NON     No, but I used to    # Former users are the majority in cluster II,
STU     Yes                  # we will relabel it to `NON`.
IV      Yes              
V       Yes              


* Do you use Haskell at work?
I       No, but I'd like to  
NON     No, but I'd like to  
STU     No, but I'd like to  
WRK     Yes, most of the time    # Another defining feature, another relabel
V       No, but I'd like to      # (Cluster IV ⟶ WRK)


* How long have you been using Haskell?
I       1 year to 2 years        # Cluster I has somewhat experienced users.
NON     1 month to 1 year   
STU     1 month to 1 year   
WRK     10 years to 15 years
V       3 years to 4 years  

* What is the total size of all the Haskell projects you contribute to?
I       <NA>                                   
NON     <NA>                                   
STU     <NA>                                   
WRK     Between 10,000 and 99,999 lines of code
V       <NA>                                   

# Unsurprisingly, professional Haskellers have been using the language
# for longer and on bigger codebases.


* How would you rate your proficiency in Haskell?
I       Intermediate
NON     Intermediate
STU     Intermediate
WKD     Advanced    
V       Intermediate


* devOnWin
I       FALSE 
NON     FALSE 
STU     FALSE 
WRK     FALSE 
V        TRUE 

* targetWin
I       FALSE 
NON     FALSE 
STU     FALSE 
WRK     FALSE        # Cluster V users target and develops on Windows.
WIN      TRUE        # We will relabel the cluster to `WIN`.

* cabalOrStack
I       Stack only
NON     Stack only
STU     Stack only
WRK     Cabal only
WIN     Stack only

* academia
I       FALSE 
NON     FALSE 
STU     FALSE 
WRK     FALSE 
WIN     FALSE 

* Have you contributed to any open source projects?
RES     TRUE  
NON     TRUE  
STU     TRUE          # Not finding any evident quality for Cluster I,
WRK     TRUE          # will relabel it as a RES (for «residual»).
WIN     TRUE  

The first coarse classification leaves us with five buckets: professional haskell users (WKD), students (STU), Windows dev (WIN), former users (NON) and a residual category (RES) of slightly experienced users.

The mode does not tell the whole story, so here are bar plots for every question/cluster combination:

The Residual group loves Stack.

The Residual group has some experience with Haskell at work, but not as much as the WRK cluster.

It is obvious from looking at the Residual cluster that it intercepts multiple traits, we could tentatively think of it as Haskeller who are working, and would love to introduce some more Haskell in their job (but have not managed already).

From clusters to opinions

Now that we have our cluster and we have put labels on them, it is time to check what each segment thinks about various Haskell topics. Let us recall the groups once again for anyone starting to read from here:

The survey contains a section named «Feelings» (statements from «I would prefer to use Haskell for my next new project» to «As a candidate, I can easily find Haskell jobs»), where the survey taker indicates how much they agree on five point scale (1: Strongly Disagree, … 5: Strongly Agree).

Here is a table with the mean for each question/group. The table is ordered by descending intergroup variance, so the questions where the community is split are listed at the top, while the questions where everyone agrees are placed at the bottom.

.                                                       NON   STU   WRK   WIN   RES
Haskell is critical to my company's success.           2.11  2.5   3.86  2.16  2.72
Haskell is working well for my team.                   2.74  3.43  4.36  3.16  3.41
I have a good understanding of H. best practices.      2.56  2.97  3.71  2.67  2.99
As a hiring manager, I can easily find
qualified haskell candidates.                          2.15  2.65  3.28  2.31  2.72
As a candidate, I can easily find Haskell jobs.        1.86  2.10  2.79  1.73  2.12
I would recommend using Haskell to others.             3.68  4.53  4.49  3.96  4.27
I think that software written in Haskell is
easy to maintain.                                      3.57  4.03  4.40  3.71  4.13
I would prefer to use Haskell for my next project.     3.77  4.42  4.61  4.16  4.34
I am satisfied with Haskell as a language.             3.55  4.27  4.26  3.88  4.18
I think that Haskell libraries are easy to use.        2.67  3.22  3.36  2.90  2.97
I think that Haskell libraries work well together.     3     3.5   3.66  3.15  3.43
I think Haskell libraries are well documented.         2.5   3     2.91  2.47  2.66
I think Haskell libraries are high quality.            3.31  3.79  3.92  3.67  3.81
I can easily compare competing Haskell libraries.      2.58  2.59  2.92  2.34  2.50
I am satisfied with Haskell's compilerers              3.88  4.15  4.16  3.76  4.25
I can find H. libraries for the the things I need.     3.34  3.66  3.75  3.26  3.51
I am satisfied with Haskell's build toools.            3.12  3.38  3.54  3.12  3.45
I can easily reason about the performance of my code.  2.43  2.68  2.80  2.43  2.65
Once my Haskell program compiles, it generally does
what I intended.                                       4     4.18  3.94  3.79  4.06
I think that Haskell libraries provide a stable API.   3.21  3.52  3.44  3.43  3.47
Haskell's performance meets my needs.                  3.87  4.09  4.09  3.87  4.05
I feel welcome in the Haskell community.               3.9   4.07  4.08  4.04  3.88
I am satisfied with Haskell's package repositories.    3.69  3.93  3.80  3.72  3.79
I think that Haskell libraries perform well.           3.68  3.75  3.86  3.62  3.68

(Philoustic provided superb visualisation for this table, check Bonus)

There are some differences but in most question the averages are not that distant from each other; this is expected when using a 5 point scale.

To make comparison between clusters easier, I will use relative percentages and not absolute values in the graphs (thanks to Philoupap for suggesting this). E.g.:

this means that around 40% of «Students» segment and around 40% of «Workers» segment strongly agree with «I am satisfied with Haskell as a language», although obviously the latter cluster — and hence its absolute count — is bigger than the former.

Selected plots and comment:

I do not consider the top two (in intergroup variance) questions to be meaningful: some people picked «Neutral» instead of leaving the answer blank when the question did not apply to them. Still it is encouraging to see some «Agree» and «Strongly Agree» in the RES grop (a portion of which uses some Haskell at work).

The biggest pain points for people who tried Haskell and then abandoned it are understanding what «Best practices» are and the ability to find a Haskell job.

Windows users are less satisfied than other users with Haskell tooling. Sometimes (statisfaction with Haskell compilers and libraries documentation) they are even less happy than people that abandoned Haskell!

Every group finds it difficult to reason about Haskell code performance; every group finds it difficult to compare competing Haskell libraries. In both cases, professionals fare slightly better but still, below pass marks.

On the bright side, every group — even ex Haskellers — feels welcome in the Haskell community.

Even RES group («onboarding» users) are positive about the language and would like to use it in the future. Windows and ex-users trail.

If you want to check all the graphs, search the construction folder (feelings).

Conclusions

The goal of this analysis was to find groups in Haskell users and examine their preferences. Clustering was successful and lead to sensible (if a bit thick) buckets. Sentiments between groups vary even though not as much as I expected: this could be because of how the sampling was conducted (2020 bug, coarse 5-points scale) or because the community genuinely holds similar ideas.

Bonus

Philoustic provided six more overview visualisations for the «feeling» section. In his own words:

NA scores are ignored.

Score values span between 2 ("strongly agree") and -2 ("strongly disagree")
with 0 for neutral.

I've append a dashed line at the general or cluster mean value.

First general:

And then one for each cluster: