Who cares if I listen? A study of dodecaphonic music enjoyment

First, a one-paragraph summary of this article:


Dodecaphony (or twelve-tone serialism) is a composition technique devised by Arnold Schönberg in the first half of the XX century. A tenet of dodecaphony is the tone row: an ordered series of twelve notes of the chromatic scale. These pitches cannot be repeated inside the row, giving the composition its characteristic lack of tonal centre.

Here is an example by Anton Webern, Sehr Langsam:

Twelve-tone proponents lauds its new expressive means and its vastly more efficient — in comparison to tonal music — vocabulary. As an example, Theodor Adorno [Adorno1963] states that the «[the new music, t]hrough the art of the variation brought to the extreme, […] contains an enormous amount of detail design, it concentrates in every moment much more than what we were used to back in the days». This efficiency comes with a price: Adorno warns the reader to be as scrupolous in his listening «as a watchdog»; the serious amateur who «lets himself be carried away» during the listening — like with the old music — «is lost».

Adorno is not alone; talking about «new» music, Babbit [Babbit1958] writes:

«This music employs a tonal vocabulary which is more “efficient” than that of the music of the past, or its derivatives. This […] does make possible a greatly increased number or pitch simultaneities, successions, and relationships. This increase in efficiency necessarily reduces the “redundancy” of the language, and as a result the intelligible communication of the work demands increased accuracy from the transmitter (the performer) and activity from the receiver (the listener).»

After almost one century from the first dodecaphonic compositions, how does the twelve-tone credo fare? Are listeners experiencing — or ready to experience — an «enormous amount of detail»? Does dodecaphonic enjoyment happen through understanding of minuscule features? This study tries to shed some light on the contemporary appreciation of dodecaphonic music.

If you want to read the literature review, check Appendix 1 (summary: there is at least one study which questions the ability of educated listeners to make out one tone-row from the other), otherwise continue reading.

Experiment description

If the words of Adorno — and Babbit, and others — are true, slight modifications in the score performed will greatly diminish the enjoyment of a piece by a dodecaphonic-appreciating audience; on this kernel we will base our experiment.

The idea is very simple and will take the form of a survey: let us pick two tonal and two short dodecaphonic pieces (TON1, TON2, DOD1, DOD2). After being asked if they appreciate dodecaphonic music, the survey takers will be prompted to listen and rate the four tracks. The pieces will not labeled in any way and be rated on a 6-point Likert scale (online survey page).

Unbeknown to the survey taker, two of the pieces in the survey (one tonal and one dodecaphonic) will be “corrupted” like so: 10% of the notes of the track will be changed to a different pitch (from -4 to +7 semitones, i.e. spanning an octave). The pieces and the notes to be corrupted are to be selected at random for each survey.

Example: an F is shifted to one of the 11 notes from C♯ to C (excluding F itself).

To get a feel for what a corrupted tracks sounds like, here are one dodecaphonic piece, one tonal piece and two — of the possible many — corresponding corrupted versions:

TON2 (original)

TON2 (corrupted)

DOD1 (original)

DOD1 (corrupted)

Our goal is dual:

To verify these hypotheses, we need to choose an appropriate statistical test and to decide how big we want the sample size to be. If you are interested in the specifications, go to [Appendix 2]; it can be summarised with «we need more or less 30 surveys from dodecaphonic-appreciating listeners to have a reasonable (80%) chance of obtaining a significant result».

If you want to know the chosen pieces and gory implementation details, consult [Appendix 3].

Results: tonal pieces

As we stated, our first analysis will concern the survey takers who have not ticked the «I appreciate dodecaphonic music» box, listening to tonal pieces. Here are two images for each of the tracks: on the top part you can see a bar chart plotting the «enjoyment» answers on the original piece; on the bottom, the same chart but for the corrupted piece.

TON1 lattice graph, lower enjoyment TON2 lattice graph, lower enjoyment

The plots seem to show that corrupting (modifying the pitches) of the track impacts the listener’s enjoyment. Some descriptive statistics (37 answers):

Track Corrupted N Median IQR
TON1 No 14 5 1
TON1 Yes 23 4 1
TON2 No 23 5 0
TON2 Yes 14 4 1

IQR being the interquartile range. And finally the most important part, the U-test, which analitically checks — for each track — whether the «enjoyment» distribution is “lower” in the corrupted piece rather than in the original piece.

Track 𝑓 effect p-value
TON1 0,68 0,0295*
TON2 0,71 0,0099*

Remember that 𝑓 (the «common language effect size statistic») is the probability that «original piece» will get a higher enjoyment-score than «corrupted piece».

The important data here is p-value: since p is less than the α chosen for the experiment (0,05), there is statistical evidence that the non-dodecaphonic audience enjoys TON1 and TON2 more than their corrupted counterparts — more precisely TON1 and TON2 are stochastically superior.

Results: dodecaphonic pieces

Here are the bar charts and descriptive statistics for the «dodecaphonic audience/dodecaphonic pieces» pair (34 answers):

DOD1 lattice graph, non significant DOD2 lattice graph, non significant

Track Corrupted N Median IQR
DOD1 No 22 5 1
DOD1 Yes 12 4 1,5
DOD2 No 12 2 2,5
DOD2 Yes 22 4 2

It is interesting to note that the enjoyment of DOD2 (Webern’s Variationen für Klavier) is higher in the corrupted track than in the original one.

U-test results for the dodecaphonic pieces are:

Track 𝑓 effect p-value
DOD1 0,57 0,2492
DOD2 0,29 0,9749

In both cases p-value is greater than the α chosen for the experiment (0,05): in other words there is no evidence that the dodecaphonic-lovers audience enjoys DOD1 or DOD2 more than their corrupted counterparts.


The vision of Adorno and Babbit does not seem to hold water: a substantial corruption of dodecaphonic scores does not alter the enjoyment of the music by a dodecaphonic-appreciating audience. This does not mean per se that the listeners are victim of autosuggestion — they could still appreciate atonality in general —, but places a dent on the plethora of musicological explanations on the careful choice of the tone row, «thematic oneness» and similar beliefs.

The results of this experiment are in line what Francès [Francès1984] himself found more than 70 years ago: if we cannot discern dodecaphonic series in a complex (harmonic, polyphonic) setting, how can we enjoy one tone row more than the other? There must be other cosiderations — unrelated to the aural experience — guiding our taste: the «unité conceptuelle» (e.g. a pleasure in theoretical study and enjoyment of music of historical significance) as opposed to the «unité perceptive» (what we listen — and discern — through our ears).


The limitations of this study are:

Thanks and contact

I would like to thank FSi, morganw, Cæsarcub, brewton, Peder and BrutalSlakt for having debugged the survey and having made it more accessible; Franciman, lortabac, larsen and Tónskáld for their methodological suggestions and comments; Jeff Learman for useful information on how to invoke sox; GoldmanT for letting me use his transcription of Schönberg’s «Suite for Piano»; ibispi and qptain_nemo for proof-reading.

Feedback and questions are welcome, contact me.


14 June 2020: published.

15 June 2020: corrected TON1/TON2 instead of DOD1/DOD2 in the dodecaphonic 𝑓/p table; specified sample size outside of table too (reported by Tónskáld).

28 July 2020: minor typo corrections.

27 Novembrer 2020: punctuation correction.

Appendix 1: literature review

An interesting experiment on the perception of dodecaphonic music is found in «La Perception de la musique» [Francès1984]. In «Expérience VI» Francès asked a dodecaphonic composer to produce a number of scores (mere exposition of the row, melodic, harmonic, etc.) starting from two series; these two series being different only in the second half (so the first six notes were the same).

From those compositions, two groups (laymen and music professionals experienced with twelve-tone music) were asked to recognise whether the examples were coming from the first or the second series.

As expected, the professionals fared better than the laymen. The more musical elements were introduced on the bare series (rhythm, harmony, polyphony), the higher the number of mistakes: polyphonic segments were misclassified by more than 60% of both laymen and professionals.

This is important: how can we discern the quality the composition, its «thematic oneness» [Reti1958] if we cannot discern one series from the other? It must have been a touchy subject at the time too, as Francès clarifies to the tested volunteers: «[…] l’expérience n’avait aucun caractère polémique, ne visait qu’à établir des faits perceptifs et non a promouvoir une critique du système sériel».

[Lannoy1972] — the only other article I found on the subject — contains correct critical remarks on the work by Francès. First one: picking two series which are identical on the first six notes skewes the results, as the very beginning of a melody is the most recognisable part. Second one: the experiment should have been replicated on tonal compositions as a way to provide an «external criterion» to compare dodecaphonic recognizability with.

Appendix 2: statistical test and power analysis

The data we are handling is non-continuous and we have little idea of the shape of the distribution; this rules out usual parametric tests. Wilcoxon-Mann-Whitney U test is an attractive choice as:

We set α to be 0,95; as effect we chose the «common language effect size statistic» (𝑓=U₁/n*m), as it is useful and simple to understand.

Before surveying started, power analysis was conducted following [Shieh2006]:

power analysis graph on sample size

The assumptions in this graph (computed with R package wmwpow) are:

Then to have a reasonable statistical power (0,8) we need at least 31 observations: 16 from the «original track» population and 15 in the «modified track» one.

In the observed sample we had N=22 and M=12, enough to give us a power greater than 80% (≃0,812).

Appendix 3: implementation

The chosen pieces for the experiment were:

The MIDI files were truncated at 30 seconds, “corrupted” via an Haskell script (10% of the pitches modified) and synthesised via fluidsynth/Fluid (R3) General MIDI SoundFont (GM). Webern’s piece needed compression as the loud parts were too loud. For each track, 25 corrupted copies were generated.

The choice of corruption rate started with looking at the few bit of reviewed evidence. In his experiment Francès modified 50% of the notes in the series, but every modification was done to the second part; this setup was correctly criticised by Lannoy as the incipit is most recognisable (see Appendix 1). Even though we do not know the size of the incipit effect, a 10% corruption rate throughout the piece seems to strike a balance between the critique by Lannoy, sample-size considerations and the theories of Babbit/Adorno.

To collect the anwers I have posted this surveys in numerous classical-music forums (TalkClassical, Good Music Guide, …), on IRC and among friends. The survey answers were collected from 24 May 2020 to 12 June 2020. The file data.csv contains the anonymised answers in tabular form.

The choice of an appropriate Likert scale was not an easy one. One one hand [Leung2011] we know that higher Likert scales lead to a prized normality distribution, on the other [Nemoto2011] scales with few points are more friendly to the user and result in a higher number of completed surveys. Six was a middle-of-the-road solution, preferring labels («Strongly dislike», «Dislike», «Slightly dislike», «Slightly like», «Like», «Strongly Like») over numbers after beta-testing suggestions.

Literature cited


(translated from) Adorno, T. W. (1982). Il fido maestro sostituto: studi sulla comunicazione della musica (Vol. 431), Giulio Einaudi, p. 57. Original: Der getreue Korrepetitor: Lehrschriften zur musikalischen Praxis. Frankfurt am Main: S. Fischer Verlag.


Babbitt, M. (1958). Who cares if you listen?. High Fidelity, 8(2), 38-40.


Francès, R. (1984). La perception de la musique (Vol. 14), Vrin, 140-146.


de Lannoy, C. (1972). Detection and discrimination of dodecaphonic series. Journal of New Music Research, 1(1), 13-27.


Leung, S. O. (2011). A comparison of psychometric properties and normality in 4-, 5-, 6-, and 11-point Likert scales. Journal of Social Service Research, 37(4), 412-421.


Nemoto, T., & Beglar, D. (2014). Likert-scale questionnaires. In JALT 2013 Conference Proceedings (pp. 1-8)


Reti, R. (1958). Tonality, Atonality, Pantonality: A study of some trends in twentieth century music, London: Barrie and Rockliff, 9.


Shieh, G., Jan, S. L., & Randles, R. H. (2006). On power and sample size determinations for the Wilcoxon–Mann–Whitney test. Journal of Nonparametric Statistics, 18(1), 33-43.