학술논문

Tamaño del efecto y su intervalo de confianza y meta-análisis en Psicología
Document Type
Dissertation/Thesis
Source
Subject
tamaño del efecto
valor p
psicólogos
conocimiento
educación estadística
meta-análisis
UNESCO::PSICOLOGÍA
UNESCO::MATEMÁTICAS::Estadística
Language
Spanish; Castilian
Abstract
Evidence-Based Practice (EBP) is defined as “the integration of the best available research with clinical expertise in the context of patient characteristics, culture, and preferences” (APA, Presidential Task Force on Evidence Based Practice, 2006, p. 273). By definition, EBPP relies on the utilization of scientific research in decision making in an effort to produce the best possible services in clinical practice (Babione, 2010; Sánchez-Meca & Botella, 2010). Consequently, EBP requires to professionals new skills as the ability to critically evaluate and rank the quality of evidence or psychological research to provide the best possible service to patients by incorporating the best evidence into experience or professional judgment and opinions of patients (Sackett et al., 2000). Within this process of critical evaluation of evidence it is crucial knowing and understanding the process of the Null Hypothesis Significance Testing (NHST) as tool to data analysis, given that this procedure enjoys considerable diffusion in Psychology (Cumming et al., 2007). For example, these authors found that 97% of the articles published in Psychology journals use the NHST. Consequently, knowing how to interpret p values of probability is a core competence of the professionals in Psychology and any discipline where statistical inference is applied. The p-value linked to the results of a statistical test is the probability of witnessing the observed result or a more extreme value if the null hypothesis was true (Kline, 2013). The definition is clear and precise, however, the misconceptions of the p-value continue to be numerous and repetitive (Badenes-Ribera, Frías-Navarro, & Pascual-Soler, 2015; Falk & Greenbaum, 1995; Haller & Krauss, 2002; Kühberger et al., 2015; Oakes, 1986; Wasserstein & Lazar, 2016). The most common misconceptions of the p-value are the “inverse probability fallacy”, the “replication fallacy”, the “effect size fallacy” and the “clinical or practical significance fallacy” (Carver, 1978; Cohen, 1994; Harrison et al., 2009; Kline, 2013, Nickerson, 2000; Wasserstein & Lazar, 2016). The “inverse probability fallacy” is the false belief that the p-value indicates the probability that the null hypothesis (H0) is true, given certain data (Pr(H0| Data)). It means confusing the probability of the result, assuming that the null hypothesis is true, with the probability of the null hypothesis, given certain data (Kline, 2013; Wasserstein & Lazar, 2016). The “replication fallacy” links p-value to the degree of replicability of the result. Consequently, it is the false belief that the p-value indicate the degree of replicability of the result and its complement, 1-p, is often interpreted as an indication of the exact probability of replication (Carver, 1978; Nickerson, 2000). The “effect size fallacy” relates statistical significance to the size of the detected effect. Specifically, it involves the false belief that the p-value provides direct information about the effect size (Carver, 1978). That is, supposing that the smaller is the p value are larger the effect sizes. However, the p-value does not report on the magnitude of an effect. The effect size can only be determined by directly estimating its value with the appropriate statistic and its confidence interval (Cumming, 2012; Cumming et al., 2012; Kline, 2013; Wasserstein & Lazar, 2016). The “clinical or practical significance fallacy” is the false belief that the p-value indicates the importance of the findings (Nickerson, 2000; Wasserstein & Lazar, 2016). In this way, a statistically significant effect is interpreted as an important effect. Nevertheless, a statistically significant result does not indicate that the result is important, in the same way that a non-statistically significant result might still be important. Given the misconceptions of the p-value and other critics about the use and abuse of NHST (e.g., Monterde-i-Bort et al., 2010; Wasserstein & Lazar, 2016), the American Psychological Association (APA, 2001, 2010a) strongly recommended the reporting of effect sizes (ES) and their confidence intervals (CIs), which, taken together, clearly convey the importance of the research findings (Ferguson, 2009). There are dozens of effect size measures available (Henson, 2006; Kline, 2013). Nevertheless, they can be classified into two broad groups: measures of mean differences and measures of strength of relations (Frías-Navarro, 2011b; Kline, 2013; Rosnow & Rosenthal, 2009). The former is based on the standardized group mean difference (e. g. Cohen’s d, Glass’s g, Hedges’ g, Cohen’s f); the latter is based on the proportion of variance accounted for or correlation between two variables (e. g., R2/r2, η2, w2). The most frequently reported ES measures are the unadjusted R2, Cohen’s d, and η2 (e.g., Peng & Chen, 2014). These statistics have been criticized for bias (i.e., they tend to be positively biased), lack of robustness to outliers, and instability under violations of statistical assumptions (Grissom & Kim, 2012; Kline, 2013; Wang & Thompson, 2007). Finally, within this context of change and methodological advances, systematic, meta-analytic reviews of studies have gained considerable relevance and prevalence in the most prestigious journals (APA, 2010a; Borenstein et al., 2009). Meta-analytic studies offer several advantages over narrative reviews: meta-analysis involves a scientifically-based research process that depends on the rigor and transparence of each of the decisions made during its elaboration, and it can provide a definitive answer about the nature of an effect when there are contradictory results (Borenstein et al., 2009). Meta-analyses facilitate more precise ES estimations, they make it possible to rate the stability of the effects, and they help researchers to contextualize the ES values obtained in their study (Cumming et al., 2012). Nevertheless, meta-analytic studies are not free of bias, such as the publication bias, which is one the greatest threats to the validity of meta-analytic reviews. For example, Ferguson & Branninck (2011) analyzed 91 meta-analytic studies published in American Psychological Association and Association for Psychological Science journal and they found that of the 91 studies analyzed, 26 (41%) reported evidence of publication bias. The consequence of publication bias is an overestimation of effect size (Borenstein et al., 2009; Sánchez-Meca & Marín-Martínez, 2010). Therefore, researchers and readers of meta-analytic studies (such as, practitioner psychologists) should know methods for detecting this bias. In this way, funnel plot is a graphical that is used frequently as publication bias detection method in the health sciences (Sterne et al., 2005). Therefore, it is needed to carry out research on the degree of methodological knowledge that academic psychologists and practitioner psychologists have about methodological quality of evidence (or psychological research) for proper implementation of EBP approach. This kind of research may bring light about these issues and lead to develop training programs. Objectives The first purpose of these works was to detect the statistical reasoning errors that Spanish academic psychologists and Spanish practitioner psychologists make when presented with the results of a statistical inference test. To this end, two questions have been analyzed. The first was the extension of the most common misconceptions of the p value and the second was the extent to which p values are correctly interpreted. The second purpose was to analyze what Spanish academic psychologists and Spanish practitioner psychologists know about ES, their CIs, and meta-analyses, given that this is one of the main recommendations proposed by the APA (2010) to improve statistical practice and favor the accumulation of knowledge and the replication of findings. Finally, to check whether the results of the research on misconception of the p-value and the level of knowledge of effect sizes, confidence intervals and meta-analysis conducted in Spanish academic psychologists are reliable, it has been carried out a replication study with a sample of Chilean and Italian academic psychologists. Method Procedure Several cross-sectional studies were carried out through on-line survey. For this purpose, the e-mail addresses of Spanish, Chilean and Italian academic psychologists were found by consulting the webs of the universities at these countries. Potential participants were invited to complete a survey through the use of a CAWI (Computer Assisted Web Interviewing) system. A follow-up message was sent two weeks later to non-respondents. The data collection was performed during the 2013-2014 academic year for Spanish sample and from March to May 2015 for Chilean and Italian sample. Regarding Spanish practitioner psychologists sample, it was send an e-mail to Spanish Psychological Associations inviting them to participate in the on-line survey on professional practice in Psychology. Potential participants were invited to complete a survey through the use of a CAWI system. A follow-up message it was sent three weeks later. The data collection was performed from May to September 2015. Participants The sample of Spanish academic psychologists consisted of 472 participants. The mean number of years of the professors at the University was 13.56 years (SD = 9.27). Men represented 45.8% (n = 216) and women 54.2% (n =256). The sample of Chilean and Italian academic psychologists was comprised of 194 participants. Of these 194 participants, 159 were Italian and 35 were Chilean. Of the 159 Italians participants, 45.91% were men and 54.09% were women, with a mean age of 47.65 years (SD = 10.47). The mean number of years that the professors had spent in academia was 12.90 years (SD = 10.21). Of the 35 Chilean academic psychologists, men represented 45.71% of the sample and women54.29%. In addition, the mean age of the participants was 43.60 years (SD = 9.17). The mean number of years that the professors had spent in academia was 15 years (SD = 8.61). Finally, the sample of Spanish practitioner psychologists consisted of 77 participants (68.8% women and 31.2% men, average age of 41.44 years, SD = 9.42). Instrument The instrument applied consisted of a survey divided in two sections. The first one included items related to information about sex, age and years of experience as academic psychologist, Psychology knowledge area, kind of university (public/private). In addition, for Spanish practitioner psychologists the first section included items related to years of experience as practitioner psychologist, clinical setting (public or private), and degree of familiarity with EBP movement. The second section included items related to the knowledge on methodological issues associated with EBP, such as misconceptions of the p-value, level of knowledge about effect size statistics, confidence intervals, meta-analysis studies, and checklists of methodological quality of the studies. Data analysis All of the studies included descriptive statistics for the variables under evaluation such as frequencies and percentage. In addition, they included confidence interval for percentages (CIs). To calculate the CIs for percentages we used score methods based on the works of Newcombe (2012). All analyses were performed with the statistical program IBM SPSS v. 20 for Windows. Results and conclusions The findings indicate that the comprehension of many statistical concepts continues to be problematic among Spanish academic and practitioner psychologists, and among Chilean and Italian academic psychologists. The methodological errors and the poor methodological knowledge have been and continue to be a source of direct threat to properly implement the EBP in professional practice and getting valid scientific knowledge. Regarding misconceptions of the p-value, the “inverse probability fallacy” was the most frequently observed misinterpretation among Spanish, Italian and Chilean academic psychologists. This means that some academic psychologists confuse the probability of obtaining a result or a more extreme result if the null hypothesis was true (Pr(Data|H0) with the probability that the null hypothesis is true given some data (Pr(H0|Data). In addition, Spanish, Italian and Chilean academic psychologists from the area of Methodology were not immune to erroneous interpretations of the p-value, and this can hinder the statistical training of students and facilitate the transmission of these false beliefs, as well as their perpetuation (Haller & Krauss, 2002; Kirk, 2001; Kline, 2013; Krishnan & Idris, 2014). These findings are consistent with previous studies (Haller & Krauss, 2002; Lecoutre et al., 2003; Monterde-i-Bort et al., 2010). On the other hand, “clinical or practical significance fallacy” was the most frequently observed misinterpretation among Spanish practitioner psychologists. Nevertheless, a statistically significant result does not indicate that the result is important, in the same way that a non-statistically significant result might still be important (Nickerson, 2000; Wasserstein & Lazar, 2016). Clinical significance refers to the practical or applied value or importance of the effect of an intervention. That is, whether it makes any real (e.g., genuine, palpable, practical, noticeable) difference to the clients or to others with whom they interact in everyday life (Kazdin, 1999, 2008). Statistical significance tests have a purpose and respond to some problems and not to others. A statistical significance test does not speak about result importance, replicability, or even the probability that a result was due to chance (Carver, 1978). P-value informs us whether an effect exists, but the p-value does not reveal the size of the effect, and neither the clinical/practical significance of the effect (Ferguson, 2009; Sullivan & Feinn, 2012). The effect size can only be determined by directly estimating its value with the appropriate statistic and its confidence interval (Cohen, 1994; Cumming, 2012; Kline, 2013; Wasserstein & Lazar, 2016). Nevertheless, interpreting a statistically significant result as important or useful, confusing the alpha’s significance level with the probability that the null hypothesis is true, relating p-value to magnitude effect, and believing that the probability of replicating a result is 1-p are erroneous interpretations or false beliefs that continue to exist among academic psychologists and practitioner psychologists, like the results of the studies conducted show. These misconceptions are interpretation problems and they are not a problem of NHST itself (Leek, 2014). Behind these erroneous interpretations are some beliefs and attributions about the significance of the results. Therefore, it is necessary to improve the statistical education and training of psychologists and the content of statistics textbooks in order to guarantee high quality training of future professionals (Babione, 2010; Cumming, 2012; Kline, 2013; Haller & Krauss, 2002). Problems in understanding the p value influence the conclusions that professionals draw from their data (Hoekstra et al., 2014), jeopardizing the quality of the results of psychological research (Frías-Navarro, 2011a). The value of the evidence depends on the quality of the statistical analyses and their interpretation (Faulkner et al., 2008). On the other hand, most of the participants reported using meta-analytic studies in their professional practice and having adequate knowledge about them, including effect size statistics. Nevertheless, they acknowledged having a poor knowledge of graphical displays for meta-analyses, such as, forest plot and funnel plot, which may become in a misinterpretation of results and, therefore, lead to bad practice, taking into account that most of the participants said that they used meta-analytic studies in their professional practice. As several authors point out, the graphical presentation of results is an important part of a meta-analysis and it has become the primary tool for presenting the results of multiple studies on the same research question (Anzures-Cabrera & Higgins, 2010; Borenstein, et al., 2009, Botella & Sánchez-Meca, 2015). In this way, forest plot and funnel plot are graphics used in meta-analytic studies to present pooled effect size estimates and publication bias, respectively. Publication bias is an important threat to the validity of meta-analytic studies, since meta-analytically derived estimates could be inaccurate, typically overestimated. The funnel plot is used as a publication bias detection method in the health sciences (Sterne et al., 2005). Therefore, researchers, academics, and practitioners must adequately know funnel plots, which is a basic tool of meta-analytic studies to detect bias publication and heterogeneity of effect sizes. With regard to type of effect size statistic they know, the participants mentioned to a greater degree the effect size statistics from the family of standardized differences in means and η2 (parametric effect size statistics). Nevertheless, these effect size statistics have been criticized for lack of robustness against outliers or departure from normality, and instability under violations of statistical assumptions (Algina et al., 2005; Grissom & Kim, 2012; Kline, 2013, Peng & Chen, 2014; Wang & Thompson, 2007). There are theoretical reasons and empirical evidence that outliers and violations of statistical assumptions are common in practice (Erceg-Hurn & Mirosevich, 2008; Grissom & Kim, 2001). The findings suggest that the most of the Spanish academic psychologists, Spanish practitioner psychologists and Italian and Chilean academic psychologists do not know the alternatives for parametric effect size statistics such as, non-parametric statistics (e.g., Spearman correlation), the robust standardized mean difference (trimmed means and winsorized variances), the probability of superiority (PS), the number needed to treat (NNT), or the area under the ROC Curve (AUC) (Erceg-Hurn & Mirosevich, 2008; Ferguson, 2009; Grissom & Kim, 2012; Keselman et al., 2008; Kraemer & Kupfer, 2006; Peng & Chen, 2014; Wilcox, 2010; Wilcox & Keselman, 2003). As Erceg-Hurn and Mirosevich (2008) pointed out this might be due to lack of exposure to these methods. In this way, “the psychology statistics curriculum, journal articles, popular textbooks, and software are dominated by statistics developed before the 1960s” (op. cit., p.593). Concerning the methodological quality checklists, again most of the participants said not having knowledge about them. Nevertheless, this is an expanding field and currently there are checklists for primary studies (e.g., CONSORT), for meta-analytic studies (e.g., AMSTAR) and for network meta-analytic studies (e.g., PRISMA-NMA). On the other hand, the analysis of the researcher’s behavior associated with its methodological practices point out that Spanish, Chilean and Italian academic psychologists who could give a name of effect size statistics presented a profile more close to good statistical practices and design research. Nevertheless, three issues alert on the knowledge that both groups of academics have about effect size and validity of statistical conclusion in general: they associate wrongly effect size with the importance of a finding (clinical or practical significance fallacy), they continue to use in a high proportion p-value expressions that revolve around the oracle of the value of alpha, and they don’t know the purpose of planning a priori statistical power in a study. Finally, two events that have allowed the science debate on statistical procedures, progress towards a statistical reform and greater transparency and quality of studies, such as the open debate on the uses and abuses of statistical significance tests (which started almost since the beginning of its use) and the development of check tools such checklists (CONSORT, STROBE, PRISMA…), continue to be unknown in a high proportion by Spanish academic psychologists, Spanish practitioner psychologists and Italian and Chilean academic psychologists. Therefore, the present work provides evidence of the need for statistical training, given the problems related to adequately interpreting the results obtained with the NHST procedure and the poor knowledge of effect size statistical terms, meta-analytic studies and methodological quality checklists that Spanish academic psychologists, Spanish practitioner psychologists and Italian and Chilean academic psychologist have. The EBP requires having adequate knowledge about the fundamentals of research methodology in order to be able to critically evaluate the tests or evidence that studies include in their reports. The problems in understanding the p-value of probability, effect size statistics and meta-analytic studies influence the conclusions that professionals draw from the data, which jeopardizes the quality of the results of psychological research and a proper implementation of EBP in professional practice. As Faulkner et al. (2008) point out, the value of the evidence depends on the quality of the statistical analyses and their interpretation. Therefore, the interpretation of the findings is a quality filter that cannot be subjected to erroneous beliefs or poor interpretations of the statistical procedure. Nevertheless, it must be acknowledged several limitations in this series of studies. For instance, the low response rate might affect the representativity of the sample and, therefore, the generalizability of the findings among academic and practitioner psychologists. Nevertheless, it is possible that the participants who responded to the survey felt more confident about their statistical knowledge than those who did not respond. Should this be the case, the results might underestimate the barriers to EBP. In addition, the findings of the research on misconceptions of the p-value agree with the results of previous studies about this topic in samples of academic psychologists and undergraduates of Psychology (Badenes-Ribera, Frías-Navarro & Pascual-Soler, 2015; Falk & Greenbaum, 1995; Haller & Krauss, 2002, Kühberger et al., 2015; Monterde-i-Bort et al., 2010; Oakes, 1986). Furthermore, the findings of the research on knowledge level of effect size and meta-analytic studies in the samples of Spanish psychologists (both groups, practitioner and academic psychologists) were consistent with the results of the study on these topics in Italian and Chilean sample. All of this leads us to conclude with the need to adequately training psychologists to improve the professional practice. EBP requires professionals to critically evaluate the findings of psychological research and studies. In order to do so, training is necessary in statistical concepts, research design methodology, and the results of statistical inference tests and meta-analytic studies. For example, textbooks of statistics should include a section on the current debate and criticisms of the NHST procedure, in terms of whether statistical significance tests are the best way to advance the body of valid scientific knowledge. Moreover, they should add information about how to calculate and report the effect size and its confidence intervals, both in statistically significant results and in the non-significant ones. And finally, the authors should give examples in order to decide whether the result has practical or clinical importance (Gliner et al., 2002). On the other hand, statistical software programs should be updated to include in their menus other techniques such as the estimation of confidence intervals for parametric effect size statistics, and the estimation of effect size statistics more resistant to extreme values (outliers) and violations of the assumptions of the parametric tests (normal distribution and homogeneity of variance), such as modern robust effect size statistics and their confidence intervals. There are several websites that offer routines/programs for computing general or specific effect size estimators and their confidence intervals (see Frías-Navarro, 2011b; Fritz et al., 2012; Grissom y Kim, 2012; Kline, 2013; Peng et al., 2013). To conclude, the purpose of these studies has been especially to emphasize the need for statistical re-education among practitioner and academic psychologists, to disseminate the use of checklists, as a tool for assessing methodological quality of studies, and to motivate the development of manuals that conceptually describe statistical tests and point out the consequences of bad statistical practice on the accumulation of scientific knowledge. Also, the purpose has been to note the need for incorporating the modern robust effect size statistics in statistical programs, such as SPSS. Currently there is an open scientific and social debate that could change the course of statistical practices among researchers. For example, during the last three years criticism against the classical statistical inference procedure based on the probability value p and the dichotomous decision to keep or reject the null hypothesis has been hardened (Allison et al., 2016; Nuzzo, 2014; Wasserstein & Lazar, 2016). In addition, the low proportion of replication studies, publication bias that lead to an overestimation of the magnitude of effects, questionable statistical practices (Questionable Research Practices, QRPs) leading to find statistically significant results (called p-hacking), such as recording many response variables and deciding which to report after the analysis, reporting only statistically significant results, remove outliers and increase sample size to get statistical significance, and fraud also are current issues of discussion (Earp & Trafimow, 2015; Ioannidis, 2005a; Kepes et al., 2014). The realization of this study has tried to contribute to this debate, providing evidence of the current state of affairs, in what refers to the knowledge and practices of academic and professional psychologists in relation to the methodology and research designs. The findings of the present work are an empirical evidence of all inappropriate behaviors surrounding the process of statistical inference and that for decades have been studied by researchers, such as misinterpretations and misuse of statistical inference techniques due to statistic and effect size fallacies that surround it. Academics, scientists and professionals are not immune to such beliefs. The problem has not been resolved despite the recommendations and alerts that have been permanently detailed in scientific publications. Statistical-reeducation to correct the errors of interpretation of the various fallacies and incorporating an Evidence Based Statistical Practice oriented to the conscious and explicit use of all elements surrounding the process of statistical inference is essential to interpret critically the results of statistical inference. The literature that has been developed on statistical thinking and its education has a whole line of research open on this issue (Beyth-Maron et al., 2008; Garfield et al., 2008; Garfield, & Franklin, 2011; Garfield et al., 2011), to which might be added this investigation, highlighting its importance, its validity and its implications for the development and transmission of scientific knowledge.