(A) Data in the Life: Authorship Attribution of Lennon-McCartney Songs

The songwriting duo of John Lennon and Paul McCartney, the two founding members of the Beatles, composed some of the most popular and memorable songs of the last century. Despite having authored songs under the joint credit agreement of Lennon-McCartney, it is well-documented that most of their songs or portions of songs were primarily written by exactly one of the two. Furthermore, the authorship of some Lennon-McCartney songs is in dispute, with the recollections of authorship based on previous interviews with Lennon and McCartney in conflict. For Lennon-McCartney songs of known and unknown authorship written and recorded over the period 1962-66, we extracted musical features from each song or song portion. These features consist of the occurrence of melodic notes, chords, melodic note pairs, chord change pairs, and four-note melody contours. We developed a prediction model based on variable screening followed by logistic regression with elastic net regularization. Out-of-sample classification accuracy for songs with known authorship was 76\%, with a $c$-statistic from an ROC analysis of 83.7\%. We applied our model to the prediction of songs and song portions with unknown or disputed authorship.


Introduction
The Beatles are arguably one of the most influential music groups of all time, having sold over 600 million albums worldwide. Beyond the initial mania that accompanied their introduction to the UK and Europe in 1962-63, and subsequently to the United States in early 1964, the Beatles' musical and cultural impact still has lasting influence. The group has been the focus of academic research to an extent that rivals most classical composers. Heuger (2018) has been maintaining a bibliography that contains over 500 entries devoted to academic research on the Beatles. Some recent examples of scientific study of Beatles music include Cathé (2016) who applied harmonic vectors theory to Beatles songs, Wagner (2003) who analyzed the presence of blues motifs in Beatles music, and Brown (2004)  The idea of using statistical models to predict authorship is one that has been around for over half a century. In one of the first successful attempts at modeling word frequencies, Wallace (1963, 1984) used Bayesian classification models to infer that James Madison wrote all of the 12 disputed Federalist papers. Other recent works related to authorship attribution include Efron and Thisted (1976) and Thisted and Efron (1987), who address questions related to Shakespeare's writing, and Airoldi, Anderson, Fienberg, and Skinner (2006), who examine authorship attribution of Ronald Reagan's radio addresses.
Typical text analysis relies on constructing word histograms, and then modeling authorship as a function of word frequencies. Basic background on the analysis and modeling of word frequencies can be found in Manning and Schütze (1999), and these models applied to text authorship attribution can be found in Clement and Sharp (2003) and Malyutov (2005). This paper is concerned with using harmonic and melodic information from the corpus of Lennon-McCartney songs from the first part of the Beatles' career to infer authorship of songs by John Lennon and Paul McCartney. It is not unreasonable to assume that Lennon and McCartney songs are distinguishable through musical features. For example, both McCormick (1998) and Hartzog (2016) observed that Lennon songs have melodies that tend not to vary substantially in pitch (illustrative examples include "I Am the Walrus" and "Across the Universe"), whereas McCartney songs tend to have melodies with larger pitch changes (e.g., "Hey Jude" and "Oh Darling"). However, such anecdotal observations may not sufficiently characterize distinctions between Lennon and McCartney -a more scientific approach is necessary. Our analyses attempt to capture distinguishing musical features through a statistical approach.
Previous work applying quantitative methods to distinguish Lennon and McCartney songs is limited. Whissell (1996)  Echo Nest (the.echonest.com). More generally, a variety of statistical methods for inferring authorship from musical information have been published. Cilibrasi, Vitányi, and De Wolf (2004) and Naccache, Borgi, and Ghédira (2008) used Musical Instrument Digital Interface (MIDI) encoding of songs, which contains information on the pitch values, intervals, note durations, and instruments to perform distance-based clustering. Dubnov, Assayag, Lartillot, and Bejerano (2003) developed methods to segment music using incremental parsing applied to MIDI files in order to learn stylistic aspects of music representation. Conklin (2006) also introduced representing melody as a sequence of segments, and modeled musical style through this representation. A different approach was taken by George and Shamir (2014), who converted song data into two-dimensional spectrograms, and used these representations as a means to cluster songs.
Our approach to musical authorship attribution is most closely related to methods applied to genome expression studies and other areas in which the number of predictors is considerably larger than the sample size. In a musical context, we reduce each song to a vector of binary variables indicating the occurrences of specified local musical features. We derive the features based on the entire set of chords that can be played (harmonic content) and the entire set of notes that can be sung by the lead singer (melodic content). From the point of view of melodic sequences of notes or harmonic sequences of chords behaving like text in a document, individual notes and individual chords can be understood as 1-gram representations. The occurrence of individual chords and individual notes form an essential part of a reduction in a song's musical content. To increase the richness of the representation, we also consider 2-gram representations of chord and melodic sequences. That is, we record the occurrence of pairs of consecutive notes and pairs of consecutive chords as individual binary variables. Rather than considering larger n-gram sequences (with n > 2) as a unit of analysis, we extract local contour information of melodic sequences indicating the local shape of the melody line to be a fifth set of variables to represent local features within a song. Using occurrences of pitches in the sung melodies, chords, pitch transitions, harmonic transitions, and contour information of Lennon-McCartney songs with known authorship permits modeling of song authorship as a function of musical content.
We developed our modeling approach as a two-step algorithm. First, we kept only musical features that have a sufficiently strong bivariate association with authorship, an application of sure independence screening (Fan, 2007;Fan & Lv, 2008). With the features that remained, we then modeled the authorship attribution as a logistic regression, but estimated the model parameters using elastic net regularization (Friedman, Hastie, & Tibshirani, 2010;Zou & Hastie, 2005), an approach that flexibly constrains the average log-likelihood by a convex combination of a ridge penalty (Le Cessie & Van Houwelingen, 1992) and a lasso penalty (Tibshirani, 1996(Tibshirani, , 2011. Many other approaches to regularization are possible. For example, Kempfert and Wong (2018), who predict the authorship of Hadyn versus Mozart string quartets based on musical features, select their model through subset selection on the Bayesian information criterion (BIC) statistic. This paper proceeds as follows. We describe the background of the song data collection and formation in Section 2. This is followed in Section 3 by the development of a model for authorship attribution based on a variable screening procedure followed by elastic net logistic regression. The application of the modeling approach is described in Section 4 where we summarize the fit of the model to the corpus of Lennon-McCartney songs of known authorship, and apply the model results for predicting songs of disputed authorship. The paper concludes in Section 5 with a discussion of the utility of our approach to wider musical settings. We provide relevant background on musical notes, scales, note intervals, chords, and song structure in Appendix A.

Song Data
The data used in our analyses consist of melodic and harmonic information based on Lennon-McCartney songs that were written between 1962 and 1966. This period of Beatles music is during the years they toured and occurred before the band's activities centered on studio productions when their songwriting approaches likely changed significantly. The songs we included in our analyses were from the original UK-released albums Please Please Me, With the Beatles, A Hard Day's Night, Beatles for Sale, Help!, Rubber Soul, and Revolver, as well as all the singles from the same era that were not present in any of these albums.
The essential reference for both the melodic and harmonic content of the songs was Fujita, Hagino, Kubo, and Sato (1993), although the Isophonics online database of chords for The Beatles songs (http://isophonics.net/content/reference-annotations) provided additional points of reference for each song.
The authorship of each Lennon-McCartney song, or whether the authorship credit was in dispute, has been documented in Compton (1988), though for some songs we have found other documentation of song authorship. Aside from recording whether entire songs were written by Lennon versus McCartney, Compton also notes that in many cases songs had multiple sections with possibly different authors. For example, the song "We Can Work It Out" is credited to McCartney as the author, though the bridge section starting with the lyric "Life is very short..." was written by Lennon. In our analyses, we treat these sections as two different units of analysis with different authors. Furthermore, several songs that were acknowledged as full collaborations between Lennon and McCartney were excluded from the corpus of known authorship from which we develop our prediction algorithm. The song "The Word" is such an example of a full collaboration. It is plausible that some of the disputed songs were actually collaborations, but the current information about the songs did not permit these joint attributions. The total number of Lennon-McCartney songs or portions of songs with an undisputed individual author (Lennon or McCartney) was 70. Eight songs or portions of songs in this period were of disputed authorship.
Our process was to manually code each song's harmonic (chord) and melodic progressions.
The song content that serves as the input to our modeling strategy is a set of representations of simple melodic and harmonic patterns within each song in the form of category indicators.
That is, we let each song be represented by a vector of binary variables within the song, where each variable is the presence/absence of a musical feature that could occur in the song. We describe these representations in more detail below. The process to obtain these category indicators involves converting each song's melodic and harmonic content into a usable form. Melody lines were partitioned into phrases which were typically book-ended by rests (silence). An alternative approach would have been to model counts of musical features within songs, which is much more in line with authorship attribution analysis for text documents. A crucial difficulty with this approach is how to address repeated phrases (e.g., verses, choruses) within a song. As an extreme example that is not part of our sample, consider the later-Beatles period McCartney-written song "Hey Jude." The "na na na" fadeout, which lasts roughly four minutes on the recording, is repeated 19 times (Everett, 1999). Keeping these repeated occurrences would likely over-represent the musical ideas suggested by the phrase. We explored models in which feature counts were incorporated, including versions where the counts were capped at an upper limit (i.e., winsorizing the larger counts), and versions involving the transformation of counts to the log scale, but these approaches resulted in worse predictability than our final model. The use of whether a musical feature was present in a song produced better discriminatory power in authorship predictions.
The key of every song was standardized relative to the tonic for songs in a major key, and to the relative major (up a minor third) for songs in a minor key. If a key change occurred in the middle of the song, the harmonic and melodic information from that point onward would be standardized to the modulated key.
We constructed five different sets of musical features within each song as follows based on processed melodic and harmonic data for the collection of songs. The first set of features was chord types. Seven diatonic chords, that is, I, ii, iii, IV, V, vi, vii, which are conventionally the building blocks for most popular Western music, were their own categories. The true diatonic chord on the seventh note of the scale is a diminished chord, which was only used once, in "You Won't See Me," while the minor vii was used more often. We therefore took the liberty of using the minor vii instead as our "diatonic" chord on the seventh. Because diminished and augmented chords were used rarely in general, we collapsed all occurrences of non-diatonic major chords along with augmented chords into a single category, and nondiatonic minor chords along with diminished chords into a single category. This resulted in a total of 9 categories. We explored other category divisions, including fewer instances of collapsed categories, but the sparsity of the data across the non-diatonic, augmented, and diminished chords resulted in less reliable predictability. Additionally, we decided to group all seventh and extended chords (e.g., ninth chord, eleventh chord) with their unaltered triad counterparts.
The second set of features consisted of melodic notes. The octave in which a melodic note was sung was ignored in the construction, so that the number of note categories totaled 12 (the number of pitch classes on the chromatic scale).
The third set of features comprised chord transitions, that is, pairs of consecutive chords.
As with individual chord categories, considering all combinations of chord transitions would have resulted in an unnecessarily large number of sparsely counted categories. We collapsed the chord categories as follows. Each transition among the tonic, sub-dominant (major fourth), and dominant (major fifth) was its own category. Every other transition from a diatonic chord to another diatonic chord, regardless of the order of the two chords, was its own category. For example, transitions from ii to V were grouped with transitions from V to ii. Transitions that involved the tonic and any non-diatonic chord were grouped into one category, and transitions that involved the dominant and any non-diatonic chord were also all grouped into one category. Chord transitions starting with any non-diatonic chord, and ending with a diatonic chord (other than the tonic or dominant) was its own category, and chord transitions ending with any non-diatonic chord, and starting with a diatonic chord (other than the tonic or dominant) was its own category. Finally, all chord transitions between two non-diatonic chords fell under one category. The total number of chord transition categories totaled 24 with these raw category collapsings. Empty categories from the canon of songs were ignored.
The fourth set of features involved melodic note transitions as pairs of notes. In contrast to the single melodic note categories, we considered the octave of the second note in the pair.
Thus, each melodic note in a pair could be in a three-octave range. In addition, we considered the start and end rest of a phrase to be considered a note in constructing note transition categories. Thus a single note at the start or at the end of a phrase was each treated as a note transition. Each start of a phrase on any diatonic note was its own category, and each end of a phrase on any diatonic note was its own category. All notes on the diatonic scale transitioning from or to the tonic was its own category. Any transition from a pitch on the diatonic or pentatonic scale (which includes the flat 3 and flat 7) to another pitch on the diatonic or pentatonic scale, including the same pitch, was its own category, regardless of octave. Upward movements by 2, 3, 4, or 5 notes on the diatonic scale were individual categories, and the corresponding downward movements were their own categories.
We performed a greater amount of collapsing of categories of melodic transitions when at least one note in the transition was not on the diatonic scale. All transitions between the two same non-diatonic notes (excluding the flat 3 and flat 7) were collapsed into the same category. All melodic phrases starting on a non-diatonic note were collapsed into the same category, and all melodic phrases ending on a non-diatonic note were collapsed into the same category. A semitone upward or downward movement from a diatonic note to a non-diatonic note formed two distinct categories, as did a semitone upward or downward movement from a non-diatonic note to a diatonic note. All upward movements of at least two semitones involving a nondiatonic note were collapsed into the same category, and all downward movements of at least two semitones were collapsed into the same category. The total number of nonempty categories of melodic transitions under this collapsing scheme was 65. It is worth noting that we had also considered an alternative set of melodic transition variables. These were based to a large extent on grouping upward and downward movements by the size of the interval, but without regard to the musical function of the transition. We feel that the main groupings described above are arguably more musically justifiable because they are more directly connected to the pitches within transition pairs rather than pitch distances.
The last set of features captured local contours in the melodic line of a song. Every consecutive 4-note subset within a melodic phrase (between its start and end) was partitioned into one of 27 different categories according to the direction of each consecutive pair of notes.
For each of the three pairs of consecutive notes in a 4-note melodic sequence, the transitions could be up, down or same if the melodic notes moved up, down, or stayed the same.
Because each consecutive pair across the 4-note sequences allowed three possibilities, the representation consisted of 3 × 3 × 3 = 27 categories. Longer contours (consecutive note subsets of 5 or more notes) would provide greater contour detail, but the number of implied categories would create difficulties in model fitting especially with the relatively low number of songs to use for model-building. The contour representation is an attempt to characterize local features in the melodic line beyond 2-gram representations but without the same level of detail.
The five sets of musical features together total 137 binary variables for each song. Our modeling approach, which relies mainly on cross-validating regularized logistic regression, can result in prediction instability when a feature is shared by very few or very many songs.
We therefore removed 16 features in which five or fewer songs contained the feature, or where 66 or more songs (out of 70) contained the feature. The features shared by 66 or more songs included the tonic chord; melodic notes that included the tonic, second, third and fifth; and the 4-note contour (up, down, down). The features shared by five or fewer songs consisted of the minor seventh chord, chord transition from iii to V, upward and downward melodic transitions by 5 notes on the diatonic scale, repeated flat 3 notes, other repeated non-diatonic notes, upward melodic transition from flat 7 to flat 3, melodic transition between flat 3 and fifth, and melodic transition from flat 7 to fourth. With these exclusions, our analyses used a total of 121 musical features.
We display the most common musical characteristics by category, after the exclusions, in Table 1. Major 4th and major 5th chords are the most common among the 70 songs (after the tonic), and the melodic notes of a 4th and 6th are also common. These notes and chords are understood to be the building blocks of popular Western music. The chord transition from major 5th to tonic is also a common chord change in popular music, is well-represented in early Lennon-McCartney songs, and is often utilized as a harmonic phrase resolution. The most common melody note transitions stay on the diatonic scale, which again is in keeping with Western songwriting. Finally, the two contours listed in Table 1

A model for songwriter attribution
Our approach to modeling authorship involved a two-step process. First, we selected a subset of the 121 musical features that each had a sufficiently strong bivariate association with authorship. Second, conditional on the selected features, we modeled authorship using logistic regression regularized via elastic net penalization (Zou & Hastie, 2005) with tuning parameters optimized by cross-validation. The latter process was implemented in the R package glmnet (Friedman et al., 2010). We describe each step in more detail below.
For song i, i = 1, . . . , n, where n is the number of songs with known authorship in the training data, let (1) We let y = (y 1 , . . . , y n ) denote the vector of binary authorship indicators. For j = 1, . . . , J, where J is the total number of dichotomized musical features, let for each i = 1, . . . , n, We let X denote the n × J matrix with elements x ij , and let X j denote the j-th column of X.
The first step of our procedure is to determine a subset of the index set {1, 2, . . . , J} in which X j is sufficiently associated with authorship. This can be accomplished by computing odds ratios of the j-th binary feature with authorship and retaining features with an odds ratio (or its reciprocal) above a specified threshold. Equivalently, the selection can be performed by retaining features in which tests for significant odds ratios have p-values below a specified level. This pre-processing of features, known as sure independence screening (SIS), has been developed and explored by Fan (2007), Fan and Lv (2008), and Fan and Song (2010).
SIS is more typically employed in settings with a massive number of predictors, but in our setting provides a crude but effective way of reducing the number of features in our final model. Our final model evaluations exhibit better out-of-sample accuracy including SIS as a pre-processing step to modeling than omitting this step, as we describe in Section 4.
To implement SIS in our setting, we computed a p-value of a Pearson chi-squared test for each j = 1, . . . , J, for the significance of the odds ratio in a 2 × 2 contingency table constructed from y and X j . When the elements of any of the contingency tables has low counts, the odds ratio estimate is unstable. The reference distribution for such settings is poorly approximated by a chi-squared distribution, so we instead simulated test statistics 10,000 times from the null distribution according to Hope (1968) to obtain more reliable p-values. This procedure is implemented in the chisq.test function in base R. The p-value for each test was then compared to a pre-specified significance level to determine inclusion for modeling. See below for a detailed discussion about the specified significance level.
Suppose as a result of the variable screening we retained K variables, renumbered 1, . . . , K.
The second step of the procedure involves a logistic regression model of the form where x i = (x i1 , . . . , x iK ) , and with model parameters β 0 and β = (β 1 , . . . , β K ) . Given the possibly large number of musical features compared to the number of songs in our data set, we fit our logistic regression model through elastic net regularization. Letting be the log-likelihood of the model parameters, where X * is the n × K matrix of x ij retained from variable screening, elastic net regularization seeks to find estimates of β 0 and β, conditional on α and λ, that minimize where β 2 2 = J j=1 β 2 j and β 1 = J j=1 |β j |, and λ ≥ 0 and 0 ≤ α ≤ 1 are tuning parameters. When α = 0, regularization is of the form of a ridge (L 2 ) penalty, and when α = 1 the logistic regression is fit with a Lasso (L 1 ) penalty.
Optimization of the elastic net logistic regression parameters proceeds as follows. We consider the equally-spaced grid of values for α in {0.0, 0.1, . . . , 1.0}. For each candidate value of α, we consider 100 candidate values of λ. The choice of these candidate values is described in Friedman et al. (2010). For these 11×100 = 1100 candidate pairs (α, λ), we perform 5-fold cross-validation using the negative log-likelihoods evaluated at the withheld fold. Each fold is constructed by sampling songs stratified by author so that approximately 20% of Lennon and 20% of McCartney songs are contained in each fold. This approach preserves the balance in authorship within fold relative to the overall sample. We choose the minimizing pair of α and λ, and then minimize the target function in (5) over the coefficients β 0 and β. Zou and Hastie (2005) argued for considering the selection of λ based on a 1 standard error rule commonly used in regularization procedures, but we found in our application that choosing the minimum value resulted in better predictability.
A natural extension to regularized logistic regression is to include interactions among the predictors. Among the difficulties of including all interaction terms in a regularized regression is that the likely higher degree of sparsity among the interactions compared with the individual features makes it difficult to identify the important interactions. Futhermore, high correlations among the variables can negatively impact selection. Work aimed at discovering important interactions in a more principled manner has been explored. LeBlanc (2003, 2004) developed logic regression, a procedure that finds Boolean combinations of binary predictors in an approach similar to Bayesian CART (Chipman, George, & McCulloch, 1998). Logic regression prevents overfitting through the reduction of model complexity in growing the number of Boolean combinations that are formed. Procedures such as those by Bien, Taylor, and Tibshirani (2013) and Lim and Hastie (2015) involve building interactions only when the main effect terms are selected, and this is carried out by taking advantage of the group-lasso (Yuan & Lin, 2006). We explored these extensions to our approach, based on having already eliminated the rarely-occurring or frequently-occurring musical features, but found that out-of-sample predictability was worse than using only the additive effects of our features. An argument could be made that including interactions would better account for sets of musical features that are highly correlated. However, the extra flexibility associated with including interactions results in greater variance in the predictions that degrades our model's performance.
Rather than specifying a single significance level threshold for variable screening followed by regularized logistic regression, our selection procedure considered five different significance level thresholds: 1.0 (no variable screening), 0.75, 0.50, 0.25, and 0.10. We discuss in Section 5 the rationale for only four additional thresholds. We performed leave-one-out cross-validation in the following manner to choose the best threshold. Let X (i) and y (i) denote the predictor matrix and response vector with observation i deleted. First, for a fixed threshold t ∈ {1.0, 0.75, 0.50, 0.25, 0.10}, we performed variable screening on X (i) followed by fitting elastic net logistic regression of y (i) based on the retained features (with 5-fold cross-validation within the n − 1 songs to obtain the elastic net parameter estimates). The out-of-sample predicted probabilityp (t) i for observation i and threshold t is then computed given x i from the fitted logistic regression. The negative log-likelihood for threshold t is computed as The threshold t = t opt with the minimum LL (t) is the one chosen by this procedure. Once t opt is determined, variable screening is performed using this threshold based on all n observations followed by performing regularized logistic regression on the remaining features.

Model implementation and results
We applied our approach to authorship attribution developed in Section 3 to the corpus of 70 Lennon-McCartney songs based on the musical features described in Section 2. We first describe model summaries applied to the 70 Lennon-McCartney songs in the training data.
These summaries are based on a leave-one-out predictive analysis. We then fit our model to the full 70 songs, and use the results to make predictions on the songs and song portions that are of disputed authorship or are known to be collaborative.

Predictive validity and leave-one-out model summaries
A common approach to predictive validity in machine learning is to divide a data set into modeling, validation, and calibration subsets. Typically a model is constructed and validated iteratively on the first two subsets of the sample, and predictive properties of the approach are summarized on the withheld calibration set. See Draper (2013) for a good overview of this approach, which the author terms "calibration cross-validation." Given the small number of observations (songs) in our sample, our predictive accuracy would suffer by withholding a substantial calibration set, so instead we summarized our algorithm's quality of calibration through leave-one-out cross-validation. Specifically, we withheld one song at a time, and with the remaining 69 songs we performed the procedure described in Section 3. That is, with 69 songs at a time, we first optimized the choice of the p-value threshold for SIS through leave-one-out cross-validation (with a 68-versus-1 split to compute the out-of-sample negative log-likelihood), then with the variables selected based on the optimized p-value threshold we fit a logistic regression via elastic net regularization on the 69 songs (using 5-fold crossvalidation to estimate the tuning parameters). Finally, based on the logistic regression fit, the probability estimate of the withheld song was computed. This process was performed for all 70 songs to obtain out-of-sample predictions for each song with known authorship.  Table 2, and for the 31 songs known to be written by McCartney in Table 3. In addition to the simple classification results, we performed a receiver operating characteristic curve (ROC) analysis on the out-of-sample probability predictions for the 70 songs and fragments. The results of the analysis, which were performed using the pROC library in R (Robin et al., 2011), are summarized in Figure 2. The c-statistic (or area under the ROC curve, AUC) is 0.837, which indicates a strong level of predictive discrimination.
For each of the 70 applications of optimized variable screening followed by regularized logistic regression based on 69 songs at a time, we recorded the optimal variable screening p-value threshold. We discovered that among the p-value thresholds in the candidate set, the signif- notion that Lennon composed songs in a more traditional "rock-and-roll" style. In general, these results suggest that the greater complexity in McCartney's music is a distinguishing feature exhibited by the coefficients in Table 4 that are positive.
In addition to the coefficients, we report a measure of variable importance in the third column of Table 4. Our measure has close connections to an early approach developed in the context of random forests (Breiman, 2001). In particular, the importance of a variable can be assessed by randomly permuting its values across observations, and then computing an overall measure of model performance. The lower the performance measure after permuting the variable, the more important the variable. For our approach, randomly permuting the values of a musical feature across songs is effectively equivalent to having the feature removed because sure independence scanning should eliminate the feature in the first step of our prediction algorithm. Thus, our variable importance measure was computed as follows.
First, we removed the musical feature whose importance we wanted to assess. We then applied our out-of-sample procedure from Section 4.1 and computed 70 leave-one-out predicted probabilities. We performed an ROC analysis on these probabilities and the known authorship of the 70 songs and summarized the c-statistics in the third column of Table 4.
Lower values of the c-statistic indicate greater variable importance. The c-statistic without eliminating any features is 0.837, but some of the values in Table 4  We applied the fit of our model to make predictions for eight songs or song portions with disputed authorship, and for 11 known to be collaborations. The prediction probabilities were derived by applying the fitted logistic regression to the songs of unknown and collaborative authorship. We accompanied the probability predictions with approximate 95% confidence intervals calculated in the following manner. For each song of disputed or collaborative authorship, we computed 70 probability predictions based on leaving out each one of the 70 songs in our training sample. An approximate 95% confidence interval is constructed from the 2.5%-ile and 97.5%-ile of the 70 probability predictions for each song. It is worth noting that these intervals are conservative because one fewer song is used than with the corresponding point prediction. The probability predictions and corresponding confidence intervals are displayed in Tables 5 and 6. We also display the distributions over the 70 predicted probabilities for each disputed song as density estimates in Figure 4. For the songs and fragments of disputed authorship, all of the probabilities are lower than 0.5 suggesting that each individually had a higher probability of being written by Lennon. The 95% confidence intervals are mostly less than 0.5, though "Wait" and the bridge of "In My Life" have confidence intervals that cross 0.5. The density plots in Figure 4 demonstrate the substantial uncertainty in the probability prediction for the bridge of "In My Life" and to a lesser extent for "Wait." In most instances, the conclusions based on our model seem to match up with the suspected authorship, as discussed by Compton (1988). According to Compton, the song "Ask Me Why," which Lennon sang, was likely written by Lennon. Similarly, "Do You Want to Know a Secret?" was one that Lennon recalled having written and then given to George Harrison to sing. In "A Hard Day's Night," the verse and chorus are known to have been written by Lennon (Rybaczewski, 2018;Wiener, 1986), but McCartney seemed to remember having collaborated, perhaps with the bridge, which he sang. While McCartney wrote most and possibly all of "Michelle," Lennon claimed in some interviews that he came up with the bridge on his own, but in other interviews asserted that the bridge was a collaboration with McCartney (Compton, 1988). "Wait" is also suspected to have been written by Lennon according to Compton (1988), though in Miles (1998) Table 5, most of the collaborative songs in Table 6 were inferred to be mostly matching the style of Lennon. While four songs were inferred to be written more in McCartney's style, two exceptions are worth noting. The songs "Baby's in Black" and "The Word," according to Compton (1988), were both entirely collaborative, with Lennon having claimed that "The Word" was mostly his work. It is curious, in particular, that "The Word" is inferred with near certainty of being McCartneyauthored. One feature of the song is the predominance of the flat third. This McCartney-like motif may be responsible for the high probability that the song is inferred to be written by McCartney. The other two songs, "From Me to You" and "She Loves You," were also more likely to be McCartney-authored. Compton (1988) reported that the former was claimed to be entirely collaborative, and that the latter was initiated by McCartney even though the song was written collaboratively.
Two of the collaborations are worthy of comment. While Lennon and McCartney co-wrote "She Loves You," Lennon remembered that "it was Paul's idea" (Compton, 1988), and the probability indicates that the song is weighted towards McCartney. On the other hand, our model's probability prediction for "I Want to Hold Your Hand," which was written "eyeball to eyeball" (Compton, 1988), is that the song is much more characteristic of Lennon's style.
Indeed, in one of the Jann Wenner interviews (Wenner, 2009), Lennon opined about the beauty of the song's melody, and picked out that song along with his song "Help!" as the two Beatles' songs he might have wanted to re-record. However, perhaps the song might have been special to him as it had much more of his imprint.
Of all Lennon-McCartney songs, "In My Life" has probably garnered the greatest amount of speculation about its true author. Rolling Stone magazine considered it to be the 23rd greatest song of all time (Rolling Stone, 2011). Our model produces a probability of 18.9% that McCartney wrote the verse, and a 43.5% probability that McCartney wrote the bridge, with a large amount of uncertainty about the latter. Because it is known that Lennon wrote the lyrics, it would not be surprising that he also wrote the music. Lennon claimed (Compton, 1988) that McCartney helped with the bridge, but that was the extent of his contribution.
Breaking apart the song into the verse and the bridge separately, it is apparent that the verse is much more consistent stylistically with Lennon's songwriting. Thus, a conclusion by our model is that the verse is consistent with Lennon's songwriting style, but the bridge less so. The bridge having a probability that McCartney wrote the song closer to 0.5 may be indicative of their collaborative nature, as suggested by Lennon, of this part of the song.

Discussion
The approach to authorship attribution for Lennon-McCartney songs we developed in this paper has connections to methods used in attribution analysis of text documents. One important difference is that typical text analysis models rely on the relative frequencies of occurrence of words or word combinations. In a musical context, where repeats of musical features are intrinsic to a song's construction, the relative frequencies of the occurrence of the musical "words" may obscure their importance in characterizing an author's composition style. Another difference from typical text analysis problems is that songs include more than just one text stream. For our work, we specifically included songs' melodic note sequence and chord sequence as two streams in parallel. Our particular choice in the representation and analysis of Lennon-McCartney songs of the early Beatles period seemed to be sufficient in recovering a song's author with greater than 75% accuracy, and with a high level of discrimination (c-statistic of 0.837 from the ROC analysis).
Our model predictions, particularly for the songs with disputed authorship, seem to be sup- In typical text analyses, the choice of "stop" words, i.e., the ones used in analyses to distinguish authorship style, is often made subjectively or at least by convention. The analogous decision in a musical context is arguably much more difficult, as the complexity of choices is far greater. In our work, we needed to make many subjective decisions that influenced the construction of musical features. Such decisions included what constituted the beginning and ending of melodic phrases, whether a key change (modulation) should reset the tonic of the song, whether ad-libbed vocals should be considered part of the melodic line, how to include dual melody lines that were sung in harmony, and so on. Our guiding principle was to make choices that could be viewed as the most conservative in the sense of having the least impact on the information in the data. For example, we omitted melodic information from ad-libbed vocals, and made phrasings of melodic lines as long as possible, as shorter lines introduced extra "rests" as part of the melodic transitions. Also, when it was not clear in cases of dual melody lines which was the main melody, we included both melody lines.
It is worth noting that the model developed here was not our first attempt. We explored variations of the presented approach before arriving at our final model, including versions that permitted interactions, alternative variable selection procedures such as recursive feature elimination and stepwise variable selection, models for the musical features as a function of authorship that were inverted using Bayes rule, random forests, as well as several others.
A danger in exploring too many models, especially with our small sample size and without a true test/holdout set, is the potential to overfit. This concern may not be apparent in the presentation of our analytic summaries, which was the culmination of a series of model investigations. The concern of overfitting limited some of our explorations. For example, after having modest success using elastic net logistic regression without any variable preprocessing, we inserted variable screening parameterized by a p-value threshold based only on four threshold values. Using a greater range of thresholds, especially after having learned that elastic net alone was a promising approach, and that we were tuning the model parameters based on the same leave-one-out validation data, would have had the potential to produce overfitted predictions. We suspect that our final model, however, does not suffer from overfitting concerns in any appreciable way. First, the approach we present is actually fairly simple: the removal of musical features based on bivariate relationships with the response followed by regularized logistic regression. More complex procedures might raise questions about their generalizability. Second, we were cautious about optimizing the prediction algorithm and calibrating the predictability using out-of-sample criteria. For example, probability predictions involved leaving out data (one song at a time) to optimize the p-value threshold for variable screening, followed by leaving out portions of data (20% of the data that remained) to optimize the elastic net tuning parameters; and this entire procedure was performed leaving out one song at a time when making predictions for the songs of known authorship. This cascading application of cross-validation mitigates some of the natural concerns about possible overfitting.
Our particular modeling approach does permit extensions to address wider sets of songwriter attribution applications. Our model assumes only two authors, but this is easily extended to multiple songwriters in larger applications by modeling authorship in a multinomial logit model, for example. Another extension of our model can address changes in an author's style over time. Our application to Lennon-McCartney songs focused on a time period where the songwriters' musical styles were not changing in profound ways. To include larger spans of time where a songwriter's style may be changing, one possibility is to assume a stochastic process on the musical feature effects for each author, such as through an autoregressive process. Such an approach acknowledges that an author's style is likely to evolve gradually over time and with an uncertain trajectory. This approach would be straightforward to implement in a Bayesian setting, though implementing such a model in conjunction with variable screening would involve methodological challenges.
Several other limitations are worth mentioning. Our approach assumes that each song or (more relevantly) song portion contains sufficiently rich detail to capture musical information for distinguishing authorship. Shorter song fragments would have a scarcity of features, and probability predictions are expected to be less reliable. Furthermore, if the goal of this work was to make the most accurate predictions of a song's author, then our approach could clearly be improved by incorporating readily available additional information. Lyric content, information on a song's structure, use of rhythm, song tempo, time signature, and the identity of a song's actual singer or singers are all likely to be highly predictive and distinguishing of a song's authorship. Our decision to ignore this extra information is consistent with the larger goal of being able to establish the stylistic fingerprint of a songwriter based solely on a corpus of songs' musical content, using Lennon-McCartney songs as a sandbox for understanding the potential for this approach. Ultimately, the reduction of a songwriter's musical content into low-dimensional representations, such as a vector of musical feature effects, is the first step towards establishing musical signatures for songwriters that can be used for further analysis. For example, with many songwriters' styles characterized in a reduced form, it is possible to establish influence networks to learn about the diffusion of the creative process in popular music. With recent improvements in technology to convert audio information into formats amenable to the type of analysis we developed in this paper (Casey et al., 2008;Fu, Lu, Ting, & Zhang, 2011), larger-scale analyses of songwriters' styles are a potential area of exploration.

A Musical Background
A justification for the musical features chosen requires an understanding of Western popular music. Middle C, often denoted as C4, has frequency 261.6Hz, and the well known equallytempered 12-tone chromatic scale starting on note C4 is the sequence of notes C4, C#4, D4, D#4, E4, F4, F#4, G4, G#4, A4, A#4, B4 where each successive note is derived from the previous one by multiplying the frequency by 2 1/12 . In the above sequence, notes preceding the "4" (i.e., C, C#, D, D#, E, F, F#, G, G#, A, A#, B) are the pitches, and the number 4 refers to the octave of the note. The continuation of the sequence above is the same set of pitches, but at the next higher octave, that is, C5, C#5, D5, and so on. The 12 notes can also be visualized in a piano diagram in  The diatonic scale permeates much of Western music, and most popular songs (or portions of songs) can be analyzed to be based on a diatonic scale starting at a specific note belonging to one of the 12 pitch classes; the lowest note of the diatonic scale is called the major key, or just the key, of the song, and the note itself is the tonic of that key. Songs are often to be found in a "minor" key, based on a minor scale. For our purposes, we associate, as is often done, the minor key with the major key three semitones up, as they share the same seven notes. This particular definition of a minor key is often called the natural minor, and is the relative minor of the associated major key. For example, the key of A minor (as a natural minor) consists of the notes (A, B, C, D, E, F, G), which are the same as those in the major key of C (C, D, E, F, G, A, B), so that A minor is the relative minor associated with C major. Because the major key and relative minor share the same notes on the diatonic scale, in our work we classify songs being in the major key as a proxy for the diatonic notes.
With a given key of a song, non-diatonic notes are usually specified by their relation to the tonic. So, for example, in the key of C, the flat third and flat seventh are E and B (and they could, equivalently, be called the raised second and raised sixth, as well A note transition or an interval is a pair of notes, where the size of the interval depends on the number of semitones between them. Some sample intervals include: • unison is between two identical notes (e.g., C4 → C4).
• a major second consists of two notes where the second is two semitones (whole tone) up from the first (e.g., C4 → D4, F4 → G4).
• a major third consists of two notes where the second is four semitones (two whole tones) up from the first (e.g., C4 → E4, F4 → A4).
• a perfect fourth consists of two notes where the second is five semitones up from the first (e.g., D4 → G4).
• a perfect fifth consists of two notes where the second is seven semitones up from the first (e.g., A4 → E5).
• a major sixth consists of two notes where the second is nine semitones up from the first (e.g., D4 → B4).
• a major seventh consists of two notes where the second is 11 semitones up from the first (e.g., F4 → E5).
• an octave consists of two notes where the second is 12 semitones up from the first (e.g., C4 → C5).
The minor second, third, sixth, and seventh intervals arise by lowering the second note of the corresponding major interval by a semitone. For example, C → E is a minor third.
For intervals of a fourth and fifth, the term diminished applies when the top note of the corresponding interval is decreased by a semitone, and the term augmented applies when raising the top note a semitone. As an example, the interval C → G# is an augmented fifth in the key of C. In our choice of note transitions within pop songs, the diatonic notes (always relative to the key) have prime importance, with special emphasis on diatonic transitions to and from the tonic, transitions between small steps on the diatonic scale (which are fairly common in melody writing), and transitions along the pentatonic/blues scale. • C major, the I major chord (the tonic), consisting of notes C, E, and G.
• D minor, the ii minor chord, consisting of notes D, F, and A.
• E minor, the iii minor chord, consisting of notes E, G, and B.
• F major, the IV major chord (the subdominant), consisting of notes F, A, and C.
• G major, the V major chord (the dominant), consisting of notes G, B, and D.
• A minor, the vi minor chord, consisting of notes A, C, and E. typically has greater musical and emotional intensity than the verse, and contains identical lyrics across repeats within the song. It is common for songs to have a third musical section inserted between an occurrence of the chorus and a subsequent verse, called the bridge section.
This section musically functions as a connector between the chorus and verse, and may even undergo a modulation, that is, resetting the song to a different key, if only temporarily. Other types of sections may appear in typical pop/rock music (e.g., intro, pre-chorus, outro), but the verse, chorus, and bridge are nearly universal components of a song.
More details about the basics of melodic and harmonic structure of popular music can be found in Benward (2014) and Middleton (1990