Research Article |
|
Corresponding author: Danilo César Ament ( danament@gmail.com ) Academic editor: Brendon Boudinot
© 2025 Danilo César Ament, Eduardo A.B. Almeida.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
|
Abstract
We address an old but still controversial question of morphological phylogenetics: whether additive (or ordered) coding is beneficial to properly extracting phylogenetic information from phenotypical variation. To empirically evaluate the value of the additive coding, we compared the impact of multistate additive, non-additive, and binary codings for 14 quantitative characters in a phylogenetic analysis of a genus of phorid flies (Diptera). First, we compared which of these morphological codings were most effective for the morphological matrix to approximate the results of a molecular data set. We then compared which morphological coding strategies yielded the best Bayesian posterior probabilities when concatenated to molecular data. We also calculated consistency and retention indices for each binary element of the additive characters and contrasted these results to a measure of phylogenetic signal. Overall, these indices were lower for additive characters than for the others but still indicate reasonable accommodation in the tree. Additive coding outperformed the multistate non-additive coding by recovering higher Bayesian posterior probabilities in the concatenated dataset. Additive coding was also among the best coding strategies for the morphological matrix to approximate the phylogenetic signal from an independent source of evidence—i.e., molecular results. Therefore, quantitative information coded as additive had reasonable phylogenetic congruence with other data and improved the phylogenetic results of morphological data in most cases. These results support the use of additive coding for phylogenetic analysis and encourage other similar empirical evaluations aiming to explore the generality of the benefits of this coding method.
Homology, quantitative characters, methodology, morphology, ordered, phylogeny
Morphology continues to occupy a very important place in phylogenetic literature amidst the growing importance of phylogenomics to the interpretation of relationships among taxa (
However, some other fundamental aspects of interpreting morphological data and applying them have received less attention. One of these aspects includes deciding how to delimit homology hypotheses that will be incorporated into the analysis through character coding. Morphological characters and their states are not purely ‘observable data,’ as often referred to, but hypotheses that aim to recover the reality of continuity of information among morphological conditions, i.e., homology. Frequently, for the same set of morphological conditions, there are different ways to delimit homology hypotheses and code them as states and characters (e.g.,
One of the main still open questions regarding coding strategies deals with the best way to delimit homology hypotheses in a character with a quantitative nature. Examples include characters accounting for the number of vertebrae in certain taxa, the proportionality of a given structure (i.e., the width-to-length ratio), or relative size of a structure, such as fins with different degrees of development in relation to body size. In such quantitative cases, three main coding schemes have been proposed: binarization (i.e., split the morphological variation into two states), multistate non-additive (or unordered), and multistate additive codings (or ordered).
Additive coding, the focus of this paper, provides a system of differential transition costs among states that sets a higher transition cost between states scored with more discrepant numbers, i.e., non-adjacent character states. This system is generally applied in cases where the morphological information is interpreted to have a continuity of information at the state level (used for state delimitation) and additionally among two or more states. Additive coding has frequently been used in meristic characters (i.e., those with counts) or numerical values such as raw measurements or proportions / ratios. Continuous (
In Bayesian inference, additive information may be incorporated to the analysis by attributing to a character an instantaneous rate matrix in which the only transitions allowed are the ones between adjacent states. In a three-state character, for example, only transitions between 0 and 1 and between 1 and 2 would be allowed, with the instantaneous rate between states 0 and 2 set as 0. A lineage with state 0 would reach state 2 only by passing through the intermediary state 1 first (
In contrast, some authors argue that evolution through intermediates is not a necessary assumption of the additive characters in the context of parsimony and that they could be alternatively seen as sets of contingent hypotheses of homology (
The distinct aspects of the additive characters have received different levels of attention in the literature. Numerous studies have addressed the question of how to discretize states (or not to) for quantitative morphological features (e.g.,
Across these studies, the merits of additive coding remain controversial. Among the 30 morphological phylogenetic analyses published in the journal Cladistics from volume 33 (2017) to volume 37 (2021), only 23% (7) coded at least one character as additive (while 17 did not use this coding for any character and six did not specify their coding strategies). Considering that these researchers probably have found at least a few characters with a quantitative nature, we list two possible reasons for their choice of not coding them as additive: they could have considered that the additive characters bear unjustified evolutionary assumptions (following some of the previously cited studies that argument against this coding strategy) or they doubted of the potential of this coding to contain higher phylogenetic signal compared to the other coding strategies.
Although we believe that the theoretical basis of additive coding still warrants discussion, a proper theoretical evaluation would require a lengthy treatment beyond the scope of the present paper. Instead, our focus here is on the potential empirical value of this coding method. We present the fourth study designed to evaluate in depth the empirical information of the additive coding. We compare this and other coding methods by analyzing their congruence with other morphological characters and molecular data in an empirical phylogenetic analysis. Similarly to these three other studies, our analyses also show that quantitative information coded as additive characters has higher levels of homoplasy compared to most other morphological characters. However, our analyses also show that the additive characters were important to the morphological data approximate the results of the molecular information. We also find that additive coding yields better posterior probabilities compared to non-additive coding in a concatenated morphological/molecular matrix.
All analyses of this study use the data from the phylogenetic analysis of the phorid fly genus Coniceromyia Borgmeier (Diptera: Phoridae) of
The first 14 morphological characters of
Illustrations of the 14 morphological characters of
We considered the information of these characters adequate for evaluating the additive coding by recognizing their clear quantitative nature (characters 1–8, 10, 12–14) or differential continuity of information among their states (characters 9, 11). None of the other characters of the analysis met these criteria. The transition costs among states were established as inversely proportional to their continuity of information, i.e. lower transition costs were attributed between states with higher morphological affinities (Fig.
The morphometric characters describe the curvature of three wing veins. We established 15 equally spaced landmarks along each wing vein, and their geometric morphometric distances were calculated through primary component analysis (PCA) (Fig.
Notably, principal components are understood to have problematic interpretations in certain comparative analyses (Uyeda et al. 2015). Some analyses are prone to be misled by the PCA sampling of the multivariate pattern and artefactually conclude that the principal components evolved via ‘early bursts’ processes (Uyeda et al. 2015). These problems with PCs are demonstrated to happen when using this information directly in the analyses (as PC values) and are especially relevant in analyses that focus on character rates (Uyeda et al. 2015). Our use of principal components differs from these problematic cases as PCA is employed here as a shape descriptor prior to the discretization of the information and in the context of a phylogenetic analysis.
The five measurement ratios and six geometric morphometric characters were discretized into six states of equal range (
As detailed below, we compared six ways to deal with the 14 selected characters:
(1) removal from the analysis. This approach aimed to interpret the phylogenetic signal of the remaining 63 morphological characters and how incorporating the 14 focal characters in the different analyses may enhance or decrease this signal.
(2) regular multistate additive coding (or ordered). The characters were set in MrBayes by activating the command ‘ctype ordered: character numbers,’ and in TNT by activating the additivity in the character settings (
(3) decomposed additive character. The decomposed additive coding is the separation of the additive character information into binary characters following the proposal by
(4) regular multistate non-additive coding (unordered). In parsimony, this coding allows transitions between any states and attributes the same cost to all transitions (Fig.
(5) best binary coding. We examined the accommodation of the binary sub-characters of each decomposed additive character in the tree resulting from the analysis that included only taxa with both morphological and molecular information. Based on the retention indices (RIs), we selected the most informative binary sub-character for this tree and used it as the only representation of that additive character. The purpose of this coding was to approximate the best possible binary coding for the quantitative information.
(6) second-best binary coding. We followed the previous procedure but selected the sub-character with the second-best retention index.
Equal weights and implied weighing parsimony analyses were run with TNT (
Comparison of morphological trees to the one obtained from the molecular data (Fig.
Morphological matrices with the six codings for the 14 quantitative characters and including all taxa were analyzed under three optimality criteria: Bayesian, equal weights parsimony, and implied weighting parsimony—and compared to the results of the molecular data analyzed under the Bayesian approach. Three different partitioning strategies were applied to the morphological data in the Bayesian analyses: (1) unpartitioned, (2) one partition for the quantitative characters and another for the remaining characters, and (3) the homoplasy-based partitioning (
Schematic representation of the original data and the three analyses performed in this study. The original morphological and molecular matrices, Tree A and Tree B are provided as supplementary data at https://doi.org/10.5281/zenodo.15632525.
Homoplasy-based partitioning (
The tree obtained from the molecular data was used as a benchmark for phylogenetic signal. Certainly, the result of the molecular analysis may change whenever more genes are added to this dataset. However, the molecular tree is a relevant comparison to the morphological one as the molecular analysis had relevant taxon and gene sampling and followed a detailed protocol for obtaining and analyzing the data. This resulted in a tree with high posterior probability values for most clades (Appendix
The resulting trees were summarized into one tree through the command ‘allcompat’ in MrBayes, or as a strict consensus in parsimony analyses in TNT. Although the ‘allcompat’ summarization of trees generally is not the way Bayesian analysis deals with uncertainties (
The resultant tree had all taxa not present in the molecular tree removed using the ‘drop. tip’ routine in R package Ape (
Comparison of posterior probabilities of the concatenated datasets (Fig.
The multistate non-additive and additive codings were also compared through Bayesian posterior probabilities (BPPs). The morphological datasets with these codings were concatenated to the molecular dataset, and, for each of these concatenated matrices, the posterior probability was calculated through stepping stones sampling (
Accommodation of the additive information in the tree (Fig.
This analysis aims to explore how well the information of additive characters is accommodated in one of our better-supported trees, resulting from the analysis that includes only taxa for which we had both morphological and molecular data (Appendix
The traditional CI and RI, however, do not properly measure the information of the additive characters. These indices measure only the state-delimiting homology hypotheses and not the ideas of continuity among states particular to additive coding. Therefore, we propose a new way to use these indices to measure the accommodation of additive characters in a tree: to decompose each multistate additive character into binary sub-characters and to calculate their consistency and retention indices. The mean of these indices for each category of sub-character was compared to the mean of the binary and multistate non-additive characters.
Additionally, we measured and compared the phylogenetic signal of additive characters with those of others using Blomberg’s K index (
Considering all analyses performed, the Bayesian analysis of the morphological matrix without the 14 characters recovered a result of intermediate congruence with the molecular data (Fig.
Comparison of the trees obtained using the different treatments of the morphological quantitative data and the tree obtained by the Bayesian analysis of the molecular data. ‘Morph data in four or more partitions’ refers to the homoplasy-based partitioning (
All treatments for morphological data, phylogenetic analyses, and the data and results exploration performed in analysis 1. Columns 2–4 indicate the number of quartets shared between the tree resultant from the analysis of the morphological dataset and the tree from the Bayesian analysis of the molecular data. Column 5 indicates the percentage of resolved quartets congruent to the molecular results. — Abbreviations: BI, Bayesian Inference; part.
| Analyses | Quartets shared with the molecular tree | Quartets different from the molecular tree | Unresolved quartets | Percentage of correctly resolved quartets |
| BI, 1 partition, without 14 chars | 100880 | 77485 | 0 | 57 |
| BI, 1 partition, addit. decomposed | 102199 | 76166 | 0 | 57 |
| BI, 1 partition, additive | 97899 | 80466 | 0 | 55 |
| BI, 1 partition, non-additive | 98524 | 79841 | 0 | 55 |
| BI, 1 partition, best binary | 103286 | 75079 | 0 | 58 |
| BI, 1 partition, second best binary | 112082 | 66283 | 0 | 63 |
| BI, 2 partitions, addit. decomposed | 92478 | 85887 | 0 | 52 |
| BI, 2 partitions, additive | 107660 | 70705 | 0 | 60 |
| BI, 2 partitions, non-additive | 106462 | 71903 | 0 | 60 |
| BI, 2 partitions, best binary | 105067 | 73298 | 0 | 59 |
| BI, 2 partitions, second best binary | 106956 | 71409 | 0 | 60 |
| BI, partit. |
98627 | 79738 | 0 | 55 |
| BI, partit. |
100595 | 77770 | 0 | 56 |
| BI, partit. |
92806 | 85559 | 0 | 52 |
| BI, partit. |
106056 | 72309 | 0 | 59 |
| BI, partit. |
96721 | 81644 | 0 | 54 |
| Parsimony without 14 chars | 33707 | 10423 | 134235 | 76 |
| Parsimony additive | 118917 | 59322 | 126 | 67 |
| Parsimony non-additive | 110243 | 62919 | 5203 | 64 |
| Parsimony best binary | 17848 | 6515 | 154002 | 73 |
| Parsimony second best binary | 16178 | 6982 | 155205 | 70 |
| Without 14 chars, Impl Weigh | 74963 | 51906 | 51496 | 59 |
| Imp Weig additive decomposed | 102085 | 73852 | 2428 | 58 |
| Impl Weigh additive | 102085 | 73852 | 2428 | 58 |
| Impl Weigh non-additive | 89229 | 89136 | 0 | 50 |
| Impl Weigh best binary | 111673 | 40734 | 25958 | 73 |
| Impl Weigh second best binary | 87783 | 63205 | 27377 | 58 |
Among the Bayesian analyses, the two-partition treatments generally yielded the most congruent results with the molecular ones. The decomposed additive characters were not more congruent with the molecular information than the results of the regular additive coding. The additive coding generally had the second more congruent result among the treatments behind one of the binary codings, except in the parsimony equal weights analysis, where the additive coding had the result more congruent with the molecular data among all analyses. The best binary coding yielded the most congruent results in some analyses, such as the ones using partitioning based on homoplasy and implied weighting. However, the second-best binary coding in these analyses had a considerably less congruent result, even compared to the regression line of all the analyses (Fig.
Three parsimony equal-weights analyses resulted in the trees with the lowest resolution with less than a quarter of the total clades resolved but also a high percentage of clades congruent with those found in the molecular results (Figs
When morphological and molecular datasets were concatenated and the morphological matrix was divided into two partitions (one for quantitative and the other for the remaining characters), stepping stones sampling recovered the model assuming the quantitative characters as ordered with higher posterior probability than the model assuming them as unordered (ln marginal likelihood of –30365.03, in contrast to –30504.00). The Bayes factor calculated for these differences indicated substantial evidence favoring the additive coding as the best model for treating the data (log10 K value of 0.995).
The among-state homologies (measured as sub-characters) of the 14 studied characters had reasonable accommodation in the reference tree as shown by consistency and retention indices (Fig.
Consistency and retention indices (CI and RI) and phylogenetic signal (Blomberg’s K) of the different types of characters in Tree A (i.e., tree resultant from the analysis of the combined morphological and molecular information). CI and RI were calculated for the additive characters as the mean of each of their sub-character indices; bars indicate standard errors. A Indices of the main types of characters. B Indices of the different sub-character types in comparison to the other additive, binary, and multistate characters in the matrix.
Among all the sub-character categories, only the more asymmetrical sub-characters (one sixth and five sixths of the variation coded as distinct states) of the PC1 morphometric data were poorly accommodated by the tree (mean RI = 6.25; Fig.
The phylogenetic signals measured through Blomberg’s K statistic had similar results to the consistency and retention indices—measurement ratios and geometric morphometric characters had the lowest Ks (mean of 0.64 and 0.63, respectively) followed by other additive (mean 0.77), binary (mean 0.95), and multistate non-additive characters (mean 1.04) (Fig.
Assessing phylogenetic accuracy or the quality of the phylogenetic signal of a set of characters is challenging because the optimal hypothesis for a given empirical dataset may (often) be incongruous with available molecular benchmarks. In this context, congruence is often interpreted as indicating an approximation of a correct result that emerges from a shared evolutionary history (Nixon & Carpenter 1996). Overall, the morphological quantitative characters of our dataset enabled the morphology-based topologies to approximate the topology derived from molecular data. We interpret this as quantitative information of these characters to be capturing valuable phylogenetic signal in the analysis. Quantitative information should thus be considered in phylogenetic analysis, especially considering that morphological matrices often have a limited number of characters and could benefit from including relevant data (
The additive coding provided the best or second-best approximation to the molecular results in all analyses (sometimes tied with other coding methods, Fig.
The high levels of homoplasy (low consistency indices—CIs, Fig.
The complete information of the additive characters (measured as sub-characters) had more conflicting accommodation than the information conveyed by other characters, as shown by consistency and retention indices (Fig.
The limited number of studies investigating the phylogenetic information of additive characters hinders our understanding of their general aspects and usefulness for phylogenetic reconstruction. So far, only three other studies have been designed to evaluate the phylogenetic information conveyed by additive characters, comparing it to other sources of evidence:
a) Additive characters may have, in general, high indices of homoplasy
High levels of homoplasy in the additive characters are demonstrated herein through low values of consistency and retention indices. Similarly, the findings of
b) Additive characters may have significant topological congruence with other morphological information and molecular data
Despite the high level of homoplasy of additive characters, the present study, along with
In contrast,
c) Additive characters may not fit well to common mechanisms in the form of branch lengths (as morphological characters in general)
d) Some additive characters may have more phylogenetic information than others
In our analyses, we selected characters with a quantitative nature that could be relevant to include in the analysis, following the criterion of including characters with an apparent taxonomic relevance, as done by
Our study demonstrates that quantitative information coded as additive characters has the empirical potential to improve the phylogenetic results, at least in some scenarios and according to the measurements employed here. Additive coding outperformed multistate non-additive coding, as assessed by Bayesian statistics and the congruence of phylogenetic signal from an independent source of phylogenetic evidence. This coding strategy should, therefore, be considered more often in phylogenetic analyses than it is today, especially when using parsimony. The potential of this coding approach can be readily evaluated in Bayesian inference by comparing ordered and unordered models using Bayes factors. Despite the evidence favoring additive coding, this approach has rarely been addressed in depth in the literature, and further empirical evaluations are necessary to better understand the generality of its potential. Until progress is made in these discussions, a reasonable solution to explore the potential of additive characters could be to investigate the impact of this coding method in analyses using different optimality criteria, as done in this study.
Author contributions. DCA: data curation. DCA and EABA: conceptualization, formal analysis, investigation, methodology, project administration, resources, validation, visualization, writing original draft, writing review & editing, funding acquisition.
Competing interests. The authors have declared that no competing interests exist.
Funding. D.C.A. was supported by scholarships granted by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001. This research was partly supported by the São Paulo Research Foundation (FAPESP grants 2018/09666-5, 2019/09215-6) and by the Brazilian National Council of Technological and Scientific Development—CNPq grant 310111/2019-6 to E.A.B.A.
Ethical aspects. There are no ethical notes or aspects to declare.
Permissions. There are no permissions to declare.
Dr. Brendon Boudinot, Dr. April M. Wright and an anonymous reviewer for the valuable suggestions to our manuscript. We are also thankful to Diego S. Porto and Felipe V. Freitas for discussions about the manuscript ideas and help with the analyses. DCA research was possible in part thanks to Centro de Biodiversidade e Patrimônio Genético (UFLA, Lavras, Brazil).
Data resources
Authors: Ament DC, Almeida EAB (2025)
Supplementary files deposited in Zenodo (https://doi.org/10.5281/zenodo.15632525).
Explanation notes
Molecular matrix.nex—Matrix of 2477 molecular characters for Coniceromyia species and outgroups from
Morph matrix 1_quantitative chars with 6 states.ss—Matrix of 77 morphological characters for Coniceromyia species and outgroups from
Morph matrix 2_quantitative chars decomposed into binary characters.ss—Matrix of morphological characters for Coniceromyia species and outgroups from
Morph matrix 3_quantitative chars best binary opt.ss—Matrix of 77 morphological characters for Coniceromyia species and outgroups from
Morph matrix 4_quantitative chars 2best binary opt.ss—Matrix of 77 morphological characters for Coniceromyia species and outgroups from
Tree A obtained with all data analyzed.tre—Tree resultant from the analysis including only taxa for which we had both morphological and molecular data, used in ‘Analysis 3’ of this paper.
Tree A_All data.tif—Tree resultant from the analysis including only taxa for which we had both morphological and molecular data, used in ‘Analysis 3’ of this paper.
Tree B obtained with only molecular data.tre—Tree resultant from the analysis of the molecular matrix used in ‘Analysis 1’ of this paper.
Tree B_Molecular data.tif—Tree resultant from the analysis of the molecular matrix used in ‘Analysis 1’ of this paper.
SSMolecMorph2PartCharsAdditiv.nex—File used for stepping stones calculation of the concatenated morphological and molecular matrix with the quantitative characters treated as additive.
SSMolecMorph2PartCharsNonAdditiv.nex— File used for stepping stones calculation of the concatenated morphological and molecular matrix with the quantitative characters treated as nonadditive.
Phylogenetic signal Blomberg K.xlsx— Blomberg’s K index (
Molecular data_genes sampled per species.xlsx— Molecular data used in the analyses of this paper, including genes per species and Genbank accession numbers.
Copyright notice: This dataset is made available under the Open Database License (http://opendatacommons.org/licenses/odbl/1.0). The Open Database License (ODbL) is a license agreement intended to allow users to freely share, modify, and use this dataset while maintaining this same freedom for others, provided that the original source and author(s) are credited.