Research Article
Print
Research Article
Is additive coding useful for morphological phylogenetic analyses? An empirical evaluation
expand article infoDanilo César Ament, Eduardo A.B. Almeida§
‡ Universidade Federal de Lavras, Minas Gerais, Brazil
§ Universidade de São Paulo, Ribeirão Preto, Brazil
Open Access

Abstract

We address an old but still controversial question of morphological phylogenetics: whether additive (or ordered) coding is beneficial to properly extracting phylogenetic information from phenotypical variation. To empirically evaluate the value of the additive coding, we compared the impact of multistate additive, non-additive, and binary codings for 14 quantitative characters in a phylogenetic analysis of a genus of phorid flies (Diptera). First, we compared which of these morphological codings were most effective for the morphological matrix to approximate the results of a molecular data set. We then compared which morphological coding strategies yielded the best Bayesian posterior probabilities when concatenated to molecular data. We also calculated consistency and retention indices for each binary element of the additive characters and contrasted these results to a measure of phylogenetic signal. Overall, these indices were lower for additive characters than for the others but still indicate reasonable accommodation in the tree. Additive coding outperformed the multistate non-additive coding by recovering higher Bayesian posterior probabilities in the concatenated dataset. Additive coding was also among the best coding strategies for the morphological matrix to approximate the phylogenetic signal from an independent source of evidence—i.e., molecular results. Therefore, quantitative information coded as additive had reasonable phylogenetic congruence with other data and improved the phylogenetic results of morphological data in most cases. These results support the use of additive coding for phylogenetic analysis and encourage other similar empirical evaluations aiming to explore the generality of the benefits of this coding method.

Keywords

Homology, quantitative characters, methodology, morphology, ordered, phylogeny

1. Introduction

Morphology continues to occupy a very important place in phylogenetic literature amidst the growing importance of phylogenomics to the interpretation of relationships among taxa (Giribet 2015). On the one hand, Bayesian analyses of morphological data are in a ‘new golden age’ (Wright 2019). Models and priors once created for molecular probabilistic analyses have recently been reconsidered in the light of particularities of morphological characters (Lewis 2001; Klopfstein et al. 2015; Wright et al. 2016; Rosa et al. 2019, and others). On the other hand, studies have been questioning the fit of morphology to the common mechanisms assumed by model-based probabilistic analyses (Goloboff et al. 2018). This dispute was extensively discussed using simulations that differed in supporting parsimony (Goloboff et al. 2017) or Bayesian analyses (Wright and Hillis 2014; O’Reilly et al. 2016; Puttick et al. 2017) as the best optimality criteria for morphological characters.

However, some other fundamental aspects of interpreting morphological data and applying them have received less attention. One of these aspects includes deciding how to delimit homology hypotheses that will be incorporated into the analysis through character coding. Morphological characters and their states are not purely ‘observable data,’ as often referred to, but hypotheses that aim to recover the reality of continuity of information among morphological conditions, i.e., homology. Frequently, for the same set of morphological conditions, there are different ways to delimit homology hypotheses and code them as states and characters (e.g., Forey and Kitching 2000). The choice of a particular ‘coding strategy’ for a specific character differs substantially from the treatment of molecular data for which state delimitation is straightforward and precedes character delimitation. The coding strategy chosen with its homology hypotheses, states, and character delimitations can substantially affect the analytical results (Brazeau 2011; Simões et al. 2017), given that the analysis is itself the search for the tree that best accommodates the homology hypotheses (Farris 1983).

One of the main still open questions regarding coding strategies deals with the best way to delimit homology hypotheses in a character with a quantitative nature. Examples include characters accounting for the number of vertebrae in certain taxa, the proportionality of a given structure (i.e., the width-to-length ratio), or relative size of a structure, such as fins with different degrees of development in relation to body size. In such quantitative cases, three main coding schemes have been proposed: binarization (i.e., split the morphological variation into two states), multistate non-additive (or unordered), and multistate additive codings (or ordered).

Additive coding, the focus of this paper, provides a system of differential transition costs among states that sets a higher transition cost between states scored with more discrepant numbers, i.e., non-adjacent character states. This system is generally applied in cases where the morphological information is interpreted to have a continuity of information at the state level (used for state delimitation) and additionally among two or more states. Additive coding has frequently been used in meristic characters (i.e., those with counts) or numerical values such as raw measurements or proportions / ratios. Continuous (Goloboff et al. 2006) and aligned landmark data (Goloboff and Catalano 2011) can be considered types of additive characters using the same logic but with adjustments that exempt them from the need of information discretization.

In Bayesian inference, additive information may be incorporated to the analysis by attributing to a character an instantaneous rate matrix in which the only transitions allowed are the ones between adjacent states. In a three-state character, for example, only transitions between 0 and 1 and between 1 and 2 would be allowed, with the instantaneous rate between states 0 and 2 set as 0. A lineage with state 0 would reach state 2 only by passing through the intermediary state 1 first (Ronquist et al. 2012).

In contrast, some authors argue that evolution through intermediates is not a necessary assumption of the additive characters in the context of parsimony and that they could be alternatively seen as sets of contingent hypotheses of homology (Wilkinson 1992; Mickevich and Lipscomb 1991). A transition between 0 and 2 would be possible in an additive character, but it would require a higher cost as it violates more ideas of homology. In this view, additive coding could essentially not rely on extra assumptions other than the central ideas of phylogenetic inference, i.e., homology and how to accommodate it into trees (Hennigian auxiliary principle—shared information should be “always reason for suspecting kinship” and its “origin by convergence should not be assumed a priori”, Hennig 1966: p. 121).

The distinct aspects of the additive characters have received different levels of attention in the literature. Numerous studies have addressed the question of how to discretize states (or not to) for quantitative morphological features (e.g., Garcia-Cruz and Sosa 2006; Goloboff et al. 2006). Some studies discussed the logical basis and theoretical arguments in favor or against additive coding (Pimentel and Riggins 1987; Hauser and Presch 1991; Mickevich and Lipscomb 1991; Wilkinson 1992; Grant and Kluge 2003). Finally, that we know of, only three studies evaluated the phylogenetic signal of additive characters (discretized or not) by comparing its information to other sources of phylogenetic evidence (to molecular characters in Hendrixson and Bond 2009 and Smith and Hendricks 2013 or other types of morphological characters, Goloboff et al. 2006).

Across these studies, the merits of additive coding remain controversial. Among the 30 morphological phylogenetic analyses published in the journal Cladistics from volume 33 (2017) to volume 37 (2021), only 23% (7) coded at least one character as additive (while 17 did not use this coding for any character and six did not specify their coding strategies). Considering that these researchers probably have found at least a few characters with a quantitative nature, we list two possible reasons for their choice of not coding them as additive: they could have considered that the additive characters bear unjustified evolutionary assumptions (following some of the previously cited studies that argument against this coding strategy) or they doubted of the potential of this coding to contain higher phylogenetic signal compared to the other coding strategies.

Although we believe that the theoretical basis of additive coding still warrants discussion, a proper theoretical evaluation would require a lengthy treatment beyond the scope of the present paper. Instead, our focus here is on the potential empirical value of this coding method. We present the fourth study designed to evaluate in depth the empirical information of the additive coding. We compare this and other coding methods by analyzing their congruence with other morphological characters and molecular data in an empirical phylogenetic analysis. Similarly to these three other studies, our analyses also show that quantitative information coded as additive characters has higher levels of homoplasy compared to most other morphological characters. However, our analyses also show that the additive characters were important to the morphological data approximate the results of the molecular information. We also find that additive coding yields better posterior probabilities compared to non-additive coding in a concatenated morphological/molecular matrix.

2. Material and Methods

All analyses of this study use the data from the phylogenetic analysis of the phorid fly genus Coniceromyia Borgmeier (Diptera: Phoridae) of Ament et al. (2021). The matrix of this study combined 77 morphological characters and gene fragments of nuclear and mitochondrial regions (arginine kinase, cytochrome oxidase I, 16S rDNA, and NADH1 dehydrogenase). The morphological characters were scored for almost all 133 taxa in the matrix, while the molecular information could be obtained for 84 of these taxa. All morphological characters and DNA sequence data used herein were obtained from Ament et al. (2021); DCA was responsible for recoding morphological characters in the various approaches described below. The main character matrices, genes sampled per species and trees used in the analyses described ahead were deposited in Zenodo as supplementary files (https://doi.org/10.5281/zenodo.15632525; Appendix 1 [Data Resources]).

2.1. Characters selected for additive coding and criteria for establishing transition costs

The first 14 morphological characters of Ament et al. (2021) deal with quantitative scenarios of four different types: ratios of structure measurements (characters 1–5, illustrated in Fig. 1C–E, G), seta counts on a particular structure (character 10, Fig. 1B), geometric morphometric characters describing the curvature of wing veins (characters 7–8, 12–14, Fig. 1F), or characters with other types of morphological continuity (characters 9, 11, Fig. 1H). These characters were included in the analysis of Ament et al. (2021) due to their conspicuousness, high interspecific variation in the genus, and the apparent correspondence of their variation to taxonomic groupings. Some conditions are particularly notable for Coniceromyia taxonomists. For example, the drastically elongated foremetatarsus of some species or the shortened costa or widened space between wing veins M2 and CuA1 found in other taxa (characters 1, 4, and 5, respectively). Furthermore, these characters exhibit considerable intraspecific uniformity, which often allows their information to be used in species diagnosis (e.g., diagnosis of C. inflata, C. crassivena, and C. litopoda in Ament et al. 2020).

Figure 1. 

Illustrations of the 14 morphological characters of Ament et al. (2021) dealing with quantitative scenarios. These characters were evaluated using different coding strategies in this study. A Character 9, shape of the first flagellomere of the antenna, including the states globular, conical, and conical elongated (respectively illustrated from left to right). B Different conditions of the character 10, number of elongated setae on the foretibia (left to right, two, four and three, respectively). C Character 1, Width/length of the foremetatarsus. D Character 2, Width/length of the foretarsomere 2. E Character 4, length of the costal vein in relation to wing length, and Character 5, distance between CuA1 and M2/distance between M2 and M1. F Characters 7–8, 12–14, curvatures of the wing veins M1, M2 and CuA1 measured through morphometric analyses. G Character 3, hind femur height/length ratio. H Character 11, distribution of the basal group of strong setulae of the hind femur. Figure modified from Ament et al. (2021).

We considered the information of these characters adequate for evaluating the additive coding by recognizing their clear quantitative nature (characters 1–8, 10, 12–14) or differential continuity of information among their states (characters 9, 11). None of the other characters of the analysis met these criteria. The transition costs among states were established as inversely proportional to their continuity of information, i.e. lower transition costs were attributed between states with higher morphological affinities (Fig. 2C).

Figure 2. 

A Example of one of the 14 quantitative characters studied, and its morphological conditions coded as three distinct states. BF Most coding strategies used to incorporate this character information into the phylogenetic analyses.

The morphometric characters describe the curvature of three wing veins. We established 15 equally spaced landmarks along each wing vein, and their geometric morphometric distances were calculated through primary component analysis (PCA) (Fig. 1F). The values of PC1 and PC2 describing the curvature of wing veins were used as the continuous values to be discretized and included in the analyses as additive. We considered this an interesting way to treat each wing vein as a whole and to get the precise differences in curvature among the diverse wing veins of Coniceromyia (Fig. 1F; figs 13–16 in Ament et al. 2020).

Notably, principal components are understood to have problematic interpretations in certain comparative analyses (Uyeda et al. 2015). Some analyses are prone to be misled by the PCA sampling of the multivariate pattern and artefactually conclude that the principal components evolved via ‘early bursts’ processes (Uyeda et al. 2015). These problems with PCs are demonstrated to happen when using this information directly in the analyses (as PC values) and are especially relevant in analyses that focus on character rates (Uyeda et al. 2015). Our use of principal components differs from these problematic cases as PCA is employed here as a shape descriptor prior to the discretization of the information and in the context of a phylogenetic analysis.

The five measurement ratios and six geometric morphometric characters were discretized into six states of equal range (Ronquist et al. 2012). We are aware of the benefits of treating continuous characters as such (Goloboff et al. 2006) and that applying this method could have different results for the parsimony analyses. However, we considered our simplified approach sufficient for retrieving the general information of additiveness while maintaining the comparability of the different analyses.

2.2. Approaches to deal with the information of the selected characters

As detailed below, we compared six ways to deal with the 14 selected characters:

(1) removal from the analysis. This approach aimed to interpret the phylogenetic signal of the remaining 63 morphological characters and how incorporating the 14 focal characters in the different analyses may enhance or decrease this signal.

(2) regular multistate additive coding (or ordered). The characters were set in MrBayes by activating the command ‘ctype ordered: character numbers,’ and in TNT by activating the additivity in the character settings (Goloboff and Catalano 2016). In parsimony, this coding assigns lower transition costs to morphological conditions with greater continuity of information (Fig. 2C). In Bayesian analyses, character ordering restricts state transitions by allowing only transitions between adjacent states (Fig. 2F).

(3) decomposed additive character. The decomposed additive coding is the separation of the additive character information into binary characters following the proposal by Farris et al. (1970) (Fig. 2D). The binary characters originating from the decomposition of a single additive character will be referred to as sub-characters hereafter. The information of a three-state additive character may be decomposed into two binary sub-characters (Fig. 2D), while a six-state additive character is decomposable into five binary sub-characters. This coding is expected to differ from the regular additive coding in the analyses that deal with characters according to different weights (implied weights parsimony) or rates (Bayesian analysis), given that the sub-characters of the same additive character may be individually interpreted differently by these analyses.

(4) regular multistate non-additive coding (unordered). In parsimony, this coding allows transitions between any states and attributes the same cost to all transitions (Fig. 2B). In Bayesian inference, this coding also allows all transitions between states and assumes an equal transition rate for all of them (Fig. 2E).

(5) best binary coding. We examined the accommodation of the binary sub-characters of each decomposed additive character in the tree resulting from the analysis that included only taxa with both morphological and molecular information. Based on the retention indices (RIs), we selected the most informative binary sub-character for this tree and used it as the only representation of that additive character. The purpose of this coding was to approximate the best possible binary coding for the quantitative information.

(6) second-best binary coding. We followed the previous procedure but selected the sub-character with the second-best retention index.

2.3. Analytical parameters

Equal weights and implied weighing parsimony analyses were run with TNT (Goloboff and Catalano 2016) using the four new technologies in combination and under default options, stopping after reaching the best score 50 times. Implied weighting analyses used a K value of 3. Bayesian analyses were run with MrBayes (Ronquist et al. 2012) in Cipres (Miller et al. 2010). Each Bayesian analysis simulated 70 million generations using the Mk model (Lewis 2001) with the gamma distribution parameter for accommodating differences in character rates and the ‘variable’ coding specification to correct for bias of morphological character selection. The default MrBayes priors were used, except that overall rates were allowed to vary across partitions with “ratepr=variable”. Model parameters statefreq, revmat, shape, pinvar, and tratio were unlinked across partitions. All Bayesian runs were checked to see if they reached a standard deviation of split frequencies under 0.01, a potential scale reduction factor (PSRF) close to 1, and an effective sample size above 100 for all parameters to ensure that they reached a reliable exploration of the parameters. Stepping stones used 50 steps and the previous parameters. The input files with the matrix and parameters can be found at Appendix 1.

2.4. Analysis 1

Comparison of ­morphological trees to the one ­obtained from the molecular data (Fig. 3)

Morphological matrices with the six codings for the 14 quantitative characters and including all taxa were analyzed under three optimality criteria: Bayesian, equal weights parsimony, and implied weighting parsimony—and compared to the results of the molecular data analyzed under the Bayesian approach. Three different partitioning strategies were applied to the morphological data in the Bayesian analyses: (1) unpartitioned, (2) one partition for the quantitative characters and another for the remaining characters, and (3) the homoplasy-based partitioning (Rosa et al. 2019). These comparisons aimed to investigate which morphological coding approach best aligns with the result of the molecular data, assuming this congruence as a proxy for phylogenetic accuracy.

Figure 3. 

Schematic representation of the original data and the three analyses performed in this study. The original morphological and molecular matrices, Tree A and Tree B are provided as supplementary data at https://doi.org/10.5281/zenodo.15632525.

Homoplasy-based partitioning (Rosa et al. 2019) is a method for categorizing characters into partitions with similar evolutionary rates. First, an implied weighting parsimony analysis is performed to assess the homoplasy levels of the characters. The range of homoplasy levels observed is divided into equal intervals, and the characters are clustered accordingly. In a subsequent Bayesian analysis, each cluster is treated as a partition with its own rate multiplier; however, tree topology and branch lengths are linked across all of them. Herein, we divided the characters into six partitions according to their homoplasy levels as described previously.

The tree obtained from the molecular data was used as a benchmark for phylogenetic signal. Certainly, the result of the molecular analysis may change whenever more genes are added to this dataset. However, the molecular tree is a relevant comparison to the morphological one as the molecular analysis had relevant taxon and gene sampling and followed a detailed protocol for obtaining and analyzing the data. This resulted in a tree with high posterior probability values for most clades (Appendix 1). Moreover, the molecular dataset is expected to be independent of the morphological one and there are no reasons to suspect that artefactual molecular results would consistently favor a specific morphological coding scheme. Therefore, the high congruence between molecules and morphology we found in some of our analyses is mainly understood as shared phylogenetic signal.

The resulting trees were summarized into one tree through the command ‘allcompat’ in MrBayes, or as a strict consensus in parsimony analyses in TNT. Although the ‘allcompat’ summarization of trees generally is not the way Bayesian analysis deals with uncertainties (Brown et al. 2017), we understand that for the present purposes, the complete representation of the trees sampled, including clades present in less than 50% of the tree sample, allows a better understanding of the impact each coding had on the analysis.

The resultant tree had all taxa not present in the molecular tree removed using the ‘drop. tip’ routine in R package Ape (Paradis and Schliep 2018). These two threes were then compared through quartet similarity distances calculated using the R package Quartet (Smith 2019). Quartets were chosen for tree-topological comparisons, as this metric provides a complete description of the similarities, dissimilarities, and the degree of resolution differences between two trees (Day 1986; Estabrook et al. 1985; Smith 2019). This similarity metric can account for similarities in the relative position of taxa, even when rogue taxa (i.e., those with highly variable phylogenetic positions) are present, which can make comparisons challenging. Notably, the quartet similarity metric does not experience the issues that are known to adversely impact Robinson-Foulds distances (Robinson & Foulds 1981)—i.e., high distances between trees that differ only by one or a few taxa positioned in distant parts of these trees (Goloboff et al. 2017).

2.5. Analysis 2

Comparison of posterior probabilities of the ­concatenated datasets (Fig. 3)

The multistate non-additive and additive codings were also compared through Bayesian posterior probabilities (BPPs). The morphological datasets with these codings were concatenated to the molecular dataset, and, for each of these concatenated matrices, the posterior probability was calculated through stepping stones sampling (Xie et al. 2011). This comparison aimed to investigate which morphological coding has the best accommodation within itself, with the other 63 morphological characters, and molecular data. By using the posterior probabilities, this accommodation considers other elements besides the tree topology, such as branch lengths, character rates, and other parameters estimated by the analyses. The posterior probabilities were compared using a Bayes Factor test by first finding the K dividing the posterior probabilities Pr(D|M1)/Pr(D|M2) (prior probabilities being the same). The K value was interpreted following the table provided by Kass and Raftery (1995).

2.6. Analysis 3

Accommodation of the additive information in the tree (Fig. 3)

This analysis aims to explore how well the information of additive characters is accommodated in one of our better-supported trees, resulting from the analysis that includes only taxa for which we had both morphological and molecular data (Appendix 1). To assess the accommodation, we chose simple consistency and retention indices (CI and RI) which could be compared to those of binary and multistate non-additive characters. As the name suggests, the consistency index (CI) measures the consistency of characters with a given tree, varying from 1 to smaller values that approach 0 (the lower the value, the higher the inferred homoplasy of the data or discordance with the phylogenetic hypothesis). This index is calculated as follows: CI=m/s, where “m” is the minimum number of steps of a character and “s” is the observed number of steps of that character in the given tree. Similarly, the retention index (RI) measures potential and realized synapomorphy of a character on a given tree; it can be calculated using the formula RI=(g-s)/(g-m), where “g” is the maximum number of steps possible for a character. When the calculations are made jointly for all characters in the dataset, the results are referred to as the ensemble indices and are represented by capital letters (i.e., CI and RI).

The traditional CI and RI, however, do not properly measure the information of the additive characters. These indices measure only the state-delimiting homology hypotheses and not the ideas of continuity among states particular to additive coding. Therefore, we propose a new way to use these indices to measure the accommodation of additive characters in a tree: to decompose each multistate additive character into binary sub-characters and to calculate their consistency and retention indices. The mean of these indices for each category of sub-character was compared to the mean of the binary and multistate non-additive characters.

Additionally, we measured and compared the phylogenetic signal of additive characters with those of others using Blomberg’s K index (Blomberg et al., 2003). This is a statistic that compares the character-observed states in the tips of the reference tree with the states expected if the character evolved through Brownian motion (Blomberg et al. 2003). If the resultant K value is higher than 1, closely related species are more similar than expected under a Brownian motion model. We used the package ‘Picante’ (Kembel et al. 2010) in RStudio (Posit team 2025) to calculate Blomberg’s K.

3. Results

3.1. Which coding strategy of morphological data better approaches results achieved with molecular data (Analysis 1)?

Considering all analyses performed, the Bayesian analysis of the morphological matrix without the 14 characters recovered a result of intermediate congruence with the molecular data (Fig. 4A, white horizontal line). The analyses that included the 14 characters varied with considerably better or worse approximations of the molecular tree (blue bars compared to the white line in Fig. 4A; Table 1). In general, the Bayesian analyses without data partitioning and with partitioning based on homoplasy resulted in a tree slightly less congruent with the molecular hypothesis. The parsimony analyses varied between recovering the tree most congruent with the molecular result (when all 14 characters were coded as additive) to largely unresolved trees that hinder comparison to the molecular results. Clearly, the coding strategy of the 14 quantitative characters impacted the parsimony analyses considerably more than the Bayesian analyses (Fig. 4A).

Figure 4. 

Comparison of the trees obtained using the different treatments of the morphological quantitative data and the tree obtained by the Bayesian analysis of the molecular data. ‘Morph data in four or more partitions’ refers to the homoplasy-based partitioning (Rosa et al. 2019). A Quartets in the morphological tree shared with the molecular tree (blue), which are different (orange), and which are unresolved (gray); the white line results from the Bayesian analysis without the 14 quantitative characters and was drawn to facilitate comparison. B Percentage of ‘accurately’ resolved quartets by the number of resolved quartets. The line is the best fit exponential regression calculated from the data. The colors represent the different treatments (i.e., blue, analyses without the 14 characters; red, analyses with the binary codings; orange, analyses with the additive coding; green, analyses with the non-additive multistate coding; purple, analyses with the 14 characters decomposed into binary sub-characters). — Abbreviations: BI, Bayesian inference; EW, equal weights parsimony; IW, implied weights parsimony.

Table 1.

All treatments for morphological data, phylogenetic analyses, and the data and results exploration performed in analysis 1. Columns 2–4 indicate the number of quartets shared between the tree resultant from the analysis of the morphological dataset and the tree from the Bayesian analysis of the molecular data. Column 5 indicates the percentage of resolved quartets congruent to the molecular results. — Abbreviations: BI, Bayesian Inference; part. Rosa et al. 2019, homoplasy-based partitioning suggested by Rosa et al.

Analyses Quartets shared with the molecular tree Quartets different from the molecular tree Unresolved quartets Percentage of correctly resolved quartets
BI, 1 partition, without 14 chars 100880 77485 0 57
BI, 1 partition, addit. decomposed 102199 76166 0 57
BI, 1 partition, additive 97899 80466 0 55
BI, 1 partition, non-additive 98524 79841 0 55
BI, 1 partition, best binary 103286 75079 0 58
BI, 1 partition, second best binary 112082 66283 0 63
BI, 2 partitions, addit. decomposed 92478 85887 0 52
BI, 2 partitions, additive 107660 70705 0 60
BI, 2 partitions, non-additive 106462 71903 0 60
BI, 2 partitions, best binary 105067 73298 0 59
BI, 2 partitions, second best binary 106956 71409 0 60
BI, partit. Rosa et al. 2019, addit. decomposed 98627 79738 0 55
BI, partit. Rosa et al. 2019, additive 100595 77770 0 56
BI, partit. Rosa et al. 2019, non-additive 92806 85559 0 52
BI, partit. Rosa et al. 2019, best binary 106056 72309 0 59
BI, partit. Rosa et al. 2019, second best binary 96721 81644 0 54
Parsimony without 14 chars 33707 10423 134235 76
Parsimony additive 118917 59322 126 67
Parsimony non-additive 110243 62919 5203 64
Parsimony best binary 17848 6515 154002 73
Parsimony second best binary 16178 6982 155205 70
Without 14 chars, Impl Weigh 74963 51906 51496 59
Imp Weig additive decomposed 102085 73852 2428 58
Impl Weigh additive 102085 73852 2428 58
Impl Weigh non-additive 89229 89136 0 50
Impl Weigh best binary 111673 40734 25958 73
Impl Weigh second best binary 87783 63205 27377 58

Among the Bayesian analyses, the two-partition treatments generally yielded the most congruent results with the molecular ones. The decomposed additive characters were not more congruent with the molecular information than the results of the regular additive coding. The additive coding generally had the second more congruent result among the treatments behind one of the binary codings, except in the parsimony equal weights analysis, where the additive coding had the result more congruent with the molecular data among all analyses. The best binary coding yielded the most congruent results in some analyses, such as the ones using partitioning based on homoplasy and implied weighting. However, the second-best binary coding in these analyses had a considerably less congruent result, even compared to the regression line of all the analyses (Fig. 4B). The non-additive coding is the treatment that distanced more from the molecular results, except in the parsimony equal weights analysis (Fig. 4A).

Three parsimony equal-weights analyses resulted in the trees with the lowest resolution with less than a quarter of the total clades resolved but also a high percentage of clades congruent with those found in the molecular results (Figs 4A, B). This severe compromise of resolution for accuracy results in a tree with several polytomies that would not be satisfactory for most researchers who want to study the evolution of the genus. However, two other parsimony analyses yielded a tree with high resolution and a high percentage of clades resolved according to the molecular tree—these are the equal-weights additive and the implied weighting best binary coding.

3.2. Which morphological coding has the best posterior probabilities when concatenated with the molecular data (Analysis 2)?

When morphological and molecular datasets were concatenated and the morphological matrix was divided into two partitions (one for quantitative and the other for the remaining characters), stepping stones sampling recovered the model assuming the quantitative characters as ordered with higher posterior probability than the model assuming them as unordered (ln marginal likelihood of –30365.03, in contrast to –30504.00). The Bayes factor calculated for these differences indicated substantial evidence favoring the additive coding as the best model for treating the data (log10 K value of 0.995).

3.3. How well accommodated is the additive information in a total-­evidence tree (Analysis 3)?

The among-state homologies (measured as sub-characters) of the 14 studied characters had reasonable accommodation in the reference tree as shown by consistency and retention indices (Fig. 5A). Still, these indices are considerably lower than the ones of the binary and multistate non-additive characters studied. In general, the sub-characters of the measurement ratios and geometric morphometric PC1 characters had similar mean values for CI and RI, around 20 each (Fig. 5A). These two types of characters, also similarly, had higher CIs and lower RIs for more asymmetrical sub-characters (one state/five state homologies) in comparison to more symmetrical ones (three state/three state homologies) (Fig. 5B). The additive characters with fewer states resulted in higher retention index values (‘other additive’ in Fig. 5A), sometimes even comparable to the ones of binary and multistate non-additive characters.

Figure 5. 

Consistency and retention indices (CI and RI) and phylogenetic signal (Blomberg’s K) of the different types of characters in Tree A (i.e., tree resultant from the analysis of the combined morphological and molecular information). CI and RI were calculated for the additive characters as the mean of each of their sub-character indices; bars indicate standard errors. A Indices of the main types of characters. B Indices of the different sub-character types in comparison to the other additive, binary, and multistate characters in the matrix.

Among all the sub-character categories, only the more asymmetrical sub-characters (one sixth and five sixths of the variation coded as distinct states) of the PC1 morphometric data were poorly accommodated by the tree (mean RI = 6.25; Fig. 5B). The other categories range from 13 to 62 of RI, indices that reflect reasonable accommodation and grouping information. Notably, the multistate non-additive characters, which had the highest indices, deal with a very different scenario than the focal characters of this study. These non-additive characters do not meet the criteria we used to consider the application of additive coding, i.e. quantitative information or continuity between states, and therefore were not evaluated using other coding strategies.

The phylogenetic signals measured through Blom­berg’s K statistic had similar results to the consistency and retention indices—measurement ratios and geometric morphometric characters had the lowest Ks (mean of 0.64 and 0.63, respectively) followed by other additive (mean 0.77), binary (mean 0.95), and multistate non-additive characters (mean 1.04) (Fig. 5A). All character categories had high variance of K values, with additive characters varying from 0.38 to 1.30, binary characters from 0.16 to 4.11 and multistate non-additive from 0.35 to 3.05 (Appendix 1). Only one character of ‘other additive,’ seven ‘binary,’ and five ‘non-additive’ characters had K values higher than 1, implying that their observed states are more similar among closely related species than if these characters evolved under Brownian motion.

4. Discussion

Assessing phylogenetic accuracy or the quality of the phylogenetic signal of a set of characters is challenging because the optimal hypothesis for a given empirical dataset may (often) be incongruous with available molecular benchmarks. In this context, congruence is often interpreted as indicating an approximation of a correct result that emerges from a shared evolutionary history (Nixon & Carpenter 1996). Overall, the morphological quantitative characters of our dataset enabled the morphology-based topologies to approximate the topology derived from molecular data. We interpret this as quantitative information of these characters to be capturing valuable phylogenetic signal in the analysis. Quantitative information should thus be considered in phylogenetic analysis, especially considering that morphological matrices often have a limited number of characters and could benefit from including relevant data (Wright & Hillis 2014). The results also showed the importance of the treatment of these quantitative characters, with some coding strategies yielding results that were much more congruent with the molecular ones than others (Fig. 4).

The additive coding provided the best or second-best approximation to the molecular results in all analyses (sometimes tied with other coding methods, Fig. 4). The “best” or “second-best” decomposed binary codings (binary divisions with highest retention indices) varied in their approximation of the molecular tree, sometimes with more congruent results than the additive coding, other times with less congruent results (Fig. 4). Notably, we used for the binary codings splits of the variation into states designed a priori to maximize the synapomorphy potential of these characters (with higher retention indices). Even with these “ideal” binary splits, either one or both binary codings were outperformed by the additive coding in approximating the molecular results in most analyses. Two additional considerations are necessary to interpret these binary codings results: (a) less-ideal splits of the binary characters (with lower retention indices) could possibly lead to results that are less congruent with the molecular tree and (b) in most analyses there is no way to know a priori which binary splits would contain more synapomorphic (or phylogenetic) information. Based on the varying results of our binary codings and on their inherent uncertainty during state delimitation, we interpret that the additive coding is rather a more consistent way to assure the incorporation of the phylogenetic information from quantitative data into the analysis. Alluding to our empirical results, the additive coding does not depend on an ideal scenario to achieve a more congruent approximation to the molecular tree.

The high levels of homoplasy (low consistency indices—CIs, Fig. 5) we found for the additive characters should have influenced how some analyses weighted their information. The equal weights parsimony analysis allows additive characters to have more power in selecting the optimal trees because this analysis does not downweight characters with high homoplasy levels (as in implied weights parsimony) nor interprets characters according to estimated transition rates (model-based probabilistic analyses). Interestingly, the equal-weights parsimony analysis produced a fully resolved tree most congruent with the molecular data (Fig. 4). This indicates that the additive characters information could not have contributed to other analyses with as much power as in the equal-weights approach possibly due to the differential weighting applied to the various characters. The additive coding was more congruent with the molecular data than the non-additive in almost all analyses (Fig. 4), and it also yielded better results than the non-additive as assessed by Bayesian posterior probabilities.

The complete information of the additive characters (measured as sub-characters) had more conflicting accommodation than the information conveyed by other characters, as shown by consistency and retention indices (Fig. 5). Still, as suggested by CI and RI, almost all categories of additive sub-characters had reasonable accommodation in the tree and grouping information (Fig. 5). The indices show that the additive characters had homoplastic but potentially helpful information. Additive characters also had the lowest mean of phylogenetic signals with K values ranging from 0.38 to 1.30 (Fig. 5A). Notably, other types of morphological characters also had low phylogenetic signals (K under 1, Fig. 5A). These low scores could be related to the possible poor fit of morphological characters to branch lengths, which are present in the Blomberg’s K estimation (discussed further below).

The limited number of studies investigating the phylogenetic information of additive characters hinders our understanding of their general aspects and usefulness for phylogenetic reconstruction. So far, only three other studies have been designed to evaluate the phylogenetic information conveyed by additive characters, comparing it to other sources of evidence: Goloboff et al. (2006), Hendrixson and Bond (2009), and Smith and Hendricks (2013). Our results align with those of these three studies, indicating four likely recurrent aspects of these characters that merit further empirical investigations, as explored in the topics below.

a) Additive characters may have, in general, high indices of homoplasy

High levels of homoplasy in the additive characters are demonstrated herein through low values of consistency and retention indices. Similarly, the findings of Hendrixson and Bond (2009) suggest that homoplastic evolution may account for the low phylogenetic signal observed in these characters.

b) Additive characters may have significant topological congruence with other morphological information and molecular data

Despite the high level of homoplasy of additive characters, the present study, along with Goloboff et al. (2006) and Smith and Hendricks (2013), has demonstrated their congruence with other data sources. Quantitative information in the form of additive characters have been shown to enhance the congruence of phylogenetic information of morphological and molecular data in some analyses of this study, as well as to increase clade support when using exclusively morphological data (Goloboff et al. 2006), and to recover relationships of certain taxa similar to those obtained from other sources of data (Smith and Hendricks 2013). In our analyses, incorporating additive characters into a dataset with other morphological characters yielded similar or better (but never worse) results regarding similarity to the molecular tree. This congruence with other data sources is evidence for a shared phylogenetic signal.

In contrast, Hendrixson and Bond (2009) found that continuous additive characters had low phylogenetic signal and low likelihood scores in a molecular reference tree. Our analyses also showed relatively low phylogenetic signal for additive characters measured through Blomberg’s K statistic (Fig. 5A). In addition to other factors, both measures account for the fit of these characters to branch lengths estimated from the molecular data. The poor fit to branch lengths may significantly influence the low scores of additive characters in these evaluations, as elaborated in the following topic.

c) Additive characters may not fit well to common mechanisms in the form of branch lengths (as morphological characters in ­general)

Goloboff et al. (2018) demonstrated that empirical sets of morphological data poorly fit common mechanisms represented by branch lengths. Although Goloboff et al. (2018) did not specifically evaluate additive characters, it is reasonable to expect that they may exhibit this pattern. If this is the case, it is important to investigate the extent to which poor fit to branch lengths affects the scores for the additive characters in analyses that consider this (e.g., Hendrixson and Bond 2009) and other forms of optimizing and evaluating their information, such as parsimony analysis.

d) Some additive characters may have more phylogenetic information than others

Hendrixson and Bond (2009) demonstrated that some additive characters might be significantly more congruent with molecular data than others, highlighting the importance of character selection in the analysis. Goloboff et al. (2006) also indicated that certain quantitative characters could contain more useful phylogenetic information for the analysis. Our results showed additive characters with Blomberg’s K values ranging from 0.38 to 1.30 (Table 2), also confirming that they may vary considerably in phylogenetic information. However, there is still no clear way to a priori detect which quantitative data contain more useful phylogenetic information and therefore could contribute more to the analysis in the form of additive characters.

In our analyses, we selected characters with a quantitative nature that could be relevant to include in the analysis, following the criterion of including characters with an apparent taxonomic relevance, as done by Smith and Hendricks (2013). This criterion is hard to describe and has likely been implemented in different ways by various researchers. In our case, we first observed high level of interspecific variation in the information we would eventually include as quantitative characters, much higher than most phenotypic variation in the genus (except terminalia features and coloration). We also knew that the variation of this quantitative information is, in many cases, appropriate for species delimitation (e.g., frequently used in species diagnosis in Ament et al. 2020) and that some of the modifications of these characters are exclusive of some species of the genus (i.e. trait conditions that do not occur in other species of the genus, nor in any closely related genera). Finally, we observed that some conditions of these characters are present in species that share other characteristics, either morphological or similar geographical distributions.

5. Conclusions

Our study demonstrates that quantitative information coded as additive characters has the empirical potential to improve the phylogenetic results, at least in some scenarios and according to the measurements employed here. Additive coding outperformed multistate non-additive coding, as assessed by Bayesian statistics and the congruence of phylogenetic signal from an independent source of phylogenetic evidence. This coding strategy should, therefore, be considered more often in phylogenetic analyses than it is today, especially when using parsimony. The potential of this coding approach can be readily evaluated in Bayesian inference by comparing ordered and unordered models using Bayes factors. Despite the evidence favoring additive coding, this approach has rarely been addressed in depth in the literature, and further empirical evaluations are necessary to better understand the generality of its potential. Until progress is made in these discussions, a reasonable solution to explore the potential of additive characters could be to investigate the impact of this coding method in analyses using different optimality criteria, as done in this study.

6. Declarations

Author contributions. DCA: data curation. DCA and EABA: conceptualization, formal analysis, investigation, methodology, project administration, resources, validation, visualization, writing original draft, writing review & editing, funding acquisition.

Competing interests. The authors have declared that no competing interests exist.

Funding. D.C.A. was supported by scholarships granted by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001. This research was partly supported by the São Paulo Research Foundation (FAPESP grants 2018/09666-5, 2019/09215-6) and by the Brazilian National Council of Technological and Scientific Development—CNPq grant 310111/2019-6 to E.A.B.A.

Ethical aspects. There are no ethical notes or aspects to declare.

Permissions. There are no permissions to declare.

7. Acknowledgments

Dr. Brendon Boudinot, Dr. April M. Wright and an anonymous reviewer for the valuable suggestions to our manuscript. We are also thankful to Diego S. Porto and Felipe V. Freitas for discussions about the manuscript ideas and help with the analyses. DCA research was possible in part thanks to Centro de Biodiversidade e Patrimônio Genético (UFLA, Lavras, Brazil).

8. References

  • Ament DC (2025) Supplementary files of Ament & Almeida (submitted) Is additive coding useful for morphological phylogenetic analyses? An empirical evaluation [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15632525
  • Ament DC, Kung Giar-Ann, Brown BV (2020) Forty-one new species of Coniceromyia Borgmeier (Diptera: Phoridae), an identification key, and new distributional records for the species of the genus. ­Zootaxa 4830(1): 001–061. https://doi.org/10.11646/zoo­taxa.4830.1.1
  • Ament DC, Hash JM, Almeida EAB (2021) Remarkable sexually dimorphic features of Coniceromyia (Diptera: Phoridae): evolution in the light of the phylogeny and comparative evidence about their function. Biological Journal of the Linnean Society 132: 521–538. https://doi.org/10.1093/biolinnean/blaa217
  • Blomberg SP, Garland Jr T, Ives AR (2003) Testing for phylogenetic signal in comparative data: behavioral traits are more labile. Evolution 57: 717–745.
  • Brown JW, Parins-Fukuchi C, Stull GW, Vargas OM, Smith SA (2017) Bayesian and likelihood phylogenetic reconstructions of morphological traits are not discordant when taking uncertainty into consideration: a comment on Puttick et al. Proceedings of the Royal Society B 284: 1–3. https://doi.org/10.1098/rspb.2017.0986
  • Estabrook GF, McMorris FR, Meacham CA (1985) Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology 34(2): 193–200. https://doi.org/­10.2307/24133
  • Farris JS (1983) The logical basis of phylogenetic analysis. In: Platnick N, Funk V (Eds) Advances in Cladistics II, New York: Columbia University Press, 7–36.
  • Farris JS, Kluge AG, Eckardt MJ (1970) A numerical approach to phylogenetic systematics. Systematic Zoology 19: 172–189. https://doi.org/10.2307/2412452
  • Forey P, Kitching I (2000) Experiments in coding multistate characters. In: Scotland RW, Pennington RT (Eds) Homology and systematics: coding characters for phylogenetic analysis. London: Taylor & Francis, 54–80.
  • Goloboff PA, Catalano S (2016) TNT version 1.5, including a full implementation of geometric morphometrics. Cladistics 32: 221–238. https://doi.org/10.1111/cla.12160
  • Goloboff PA, Torres A, Arias S (2017) Weighted parsimony outperforms other methods of phylogenetic inference under models appropriate for morphology. Cladistics 34: 407–437. https://doi.org/10.1111/cla.12205
  • Goloboff PA, Pittman M, Pol D, Xu X (2018) Morphological data sets fit a common mechanism much more poorly than DNA sequences and call into question the Mkv Model. Systematic Biology 68(3): 494–504. https://doi.org/10.1093/sysbio/syy077
  • Hendrixson BE, Bond JE (2009) Evaluating the efficacy of continuous quantitative characters for reconstructing the phylogeny of a morphologically homogeneous spider taxon (Araneae, Mygalomorphae, Antrodiaetidae, Antrodiaetus). Molecular Phylogenetics and Evolution 53: 300–313. https://doi.org/10.1016/j.ympev.2009.06.001
  • Hennig W (1966) Phylogenetic Systematics. Univ. of Illinois Press, Urbana.
  • Källersjo M, Albert VA, Farris JS (1999) Homoplasy increses phylogenetic structure. Cladistics 15: 91–93.
  • Kembel SW, Cowan PD, Helmus MR, Cornwell WK, Morlon H, Ackerly DD, Blomberg SP, Webb CO (2010) Picante: R tools for integrating phylogenies and ecology. Bioinformatics 26: 1463–1464.
  • Klopfstein S, Vilhemsen L, Ronquist F (2015) A nonstationary Markov model detects directional evolution in hymenopteran morphology. Systematic Biology 64(6): 1089–1103. https://doi.org/10.1093/sysbio/syv052
  • Koch NM, Soto IM, Ramírez MJ (2015) First phylogenetic analysis of the family Neriidae (Diptera), with a study on the issue of scaling continuous characters. Cladistics 31: 142–165. https://doi.org/10.1111/cla.12084
  • Miller MA, Pfeiffer W, Schwartz T (2010) “Creating the CIPRES Science Gateway for inference of large phylogenetic trees” in Proceedings of the Gateway Computing Environments Workshop (GCE), 14 Nov. 2010, New Orleans, LA pp 1–8.
  • O’Reilly J, Puttick M, Parry L, Tanner A, Tarver J, Fleming J, Pisani D, Donoghue P (2016) Bayesian methods outperform parsimony but at the expense of precision in the estimation of phylogeny from discrete morphological data. Biology Letters 12: 20160081. https://doi.org/10.1098/rsbl.2016.0081
  • Pimentel RA, Riggins R (1987) The nature of cladistic data. Cladistics 3: 201–209.
  • Posit team (2025) RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA. http://www.posit.co
  • Puttick M, O’Reilly J, Tanner A, Fleming J, Clark J, Holloway L, Lozano-Fernandez J, Parry L, Tarver J, Pisani D (2017) Uncertain-tree: Discriminating among competing approaches to the phylogenetic analysis of phenotype data. Proceedings of the Royal Society B 284: 20162290. https://doi.org/10.1098/rspb.2016.2290
  • Ronquist F, Teslenko M, Van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP (2012) MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Systematic Biology 61(3): 539–542. https://doi.org/10.1093/sysbio/sys029
  • Rosa BB, Melo GAR, Barbeitos MS (2019) Homoplasy-based partitioning outperforms alternatives in Bayesian analysis of discrete morphological data. Systematic Biology 68(4): 657–671. https://doi.org/10.1093/sysbio/syz001
  • Simões TR, Caldwell MW, Palci A, Nydam RL (2017) Giant taxon-character matrices: quality of character constructions remains critical regardless of size. Cladistics 33: 198–219. https://doi.org/10.1111/cla.12163
  • Smith UE, Hendricks JR (2013) Geometric Morphometric Character Suites as Phylogenetic Data: Extracting Phylogenetic Signal from Gastropod Shells. Systematic Biology 62(3): 366–385. https://doi.org/10.1093/sysbio/syt002
  • Wagner GP (2001) The Character Concept in Evolutionary Biology. Academic Press, San Diego, 622 pp.
  • Wright AM (2019) A systematist’s guide to estimating Bayesian phylogenies from morphological data. Insect Systematics and Diversity 3(3): 1–14. https://doi.org/10.1093/isd/ixz006
  • Wright AM, Hillis D (2014) Bayesian analysis using a simple likelihood model outperforms parsimony for estimation of phylogeny from discrete morphological data. PLoS One 9: e109210. https://doi.org/10.1371/journal.pone.0109210
  • Wright A, Lloyd G, Hillis D (2016) Modeling character change heterogeneity in phylogenetic analyses of morphology through the use of priors. Systematic Biology 65(4): 602–611. https://doi.org/10.1093/sysbio/syv122
  • Xie W, Lewis PO, Fan Y, Kuo L, Chen M (2011) Improving Marginal Likelihood Estimation for Bayesian Phylogenetic Model Selection. Systematic Biology 60(2): 150–160. https://doi.org/10.1093/sysbio/syq085

Appendix 1

Data resources

Authors: Ament DC, Almeida EAB (2025)

Supplementary files deposited in Zenodo (https://doi.org/10.5281/zenodo.15632525).

Explanation notes

Molecular matrix.nex—Matrix of 2477 molecular characters for Coniceromyia species and outgroups from Ament et al. (2021), genes Arginine Kinase, cytochrome oxidase I, 16S rDNA, and NADH1 dehydrogenase, used to obtain tree B. This matrix was used in ‘Analysis 1’ of this paper and concatenated to the morphological data in ‘Analysis 2.’

Morph matrix 1_quantitative chars with 6 states.ss—Matrix of 77 morphological characters for Coniceromyia species and outgroups from Ament et al. (2021), quantitative characters discretized into six states. This matrix was used in ‘Analysis 1’ of this paper and concatenated to the molecular data in ‘Analysis 2.’

Morph matrix 2_quantitative chars decomposed into binary characters.ss—Matrix of morphological characters for Coniceromyia species and outgroups from Ament et al. (2021) with the quantitative characters decomposed into binary characters. Matrix used in ‘Analysis 1’ and ‘Analysis 3’ of this paper.

Morph matrix 3_quantitative chars best binary opt.ss—Matrix of 77 morphological characters for Coniceromyia species and outgroups from Ament et al. (2021) with the quantitative characters represented as their best binary coding (as explained in methods). Matrix used in ‘Analysis 1’ of this paper.

Morph matrix 4_quantitative chars 2best binary opt.ss—Matrix of 77 morphological characters for Coniceromyia species and outgroups from Ament et al. (2021) with the quantitative characters represented as their second best binary coding (as explained in methods). Matrix used in ‘Analysis 1’ of this paper.

Tree A obtained with all data analyzed.tre—Tree resultant from the analysis including only taxa for which we had both morphological and molecular data, used in ‘Analysis 3’ of this paper.

Tree A_All data.tif—Tree resultant from the analysis including only taxa for which we had both morphological and molecular data, used in ‘Analysis 3’ of this paper.

Tree B obtained with only molecular data.tre—Tree resultant from the analysis of the molecular matrix used in ‘Analysis 1’ of this paper.

Tree B_Molecular data.tif—Tree resultant from the analysis of the molecular matrix used in ‘Analysis 1’ of this paper.

SSMolecMorph2PartCharsAdditiv.nex—File used for stepping stones calculation of the concatenated morphological and molecular matrix with the quantitative characters treated as additive.

SSMolecMorph2PartCharsNonAdditiv.nex— File used for stepping stones calculation of the concatenated morphological and molecular matrix with the quantitative characters treated as nonadditive.

Phylogenetic signal Blomberg K.xlsx— Blomberg’s K index (Blomberg et al., 2003) calculated for the morphological characters of this study in the Tree A.

Molecular data_genes sampled per species.xlsx— Molecular data used in the analyses of this paper, including genes per species and Genbank accession numbers.

Copyright notice: This dataset is made available under the Open Database License (http://opendatacommons.org/­licenses/odbl/1.0). The Open Database License (ODbL) is a license agreement intended to allow users to freely share, modify, and use this dataset while maintaining this same freedom for others, provided that the original source and author(s) are credited.

login to comment