随着科学研究规模的增加,学术评价这个难题越来越难,但又是一个必须进行的工作,因为ZF、研究机构、基金资助机构和学者个人都对学术评价有需求。过去对学术评价几乎全部依赖于同行,因为没有其他值得依赖的指标,但是现在情况发生了改变,学术评价越来越依赖量化指标。
用量化指标作为评价依据比同行评价似乎更加客观,但学术评价并没有绝对可靠的量化指标,因为被引用次数等多是代表研究的热度,而不是研究的水平。没有特效药的结果就是某种病有许多药可以选择,于是学术评价的量化指标越来越多,根本原因就是没有一个能完全进行学术评价。一味药不行,只能采用复方联合用药了。不过这些指标往往不具有普遍适用性,设置的目的和出发点是健康积极的,但是往往被错误使用。例如关于杂志影响因子,是判断学术期刊的一个重要指标,但许多学术机构,尤其是中国的学术机构,将影响因子代表论文的水平。这显然是荒唐的,一篇论文发表在哪个杂志上,有许多偶然因素,杂志的水平确实有高低,在粗线条上有可比性,高端杂志上发表论文数量是国际上认可的学术水平重要参考标准,比如你能靠实力在CNS上发表10多篇某一领域的论著,你的学术水平可说是不言而喻。但是杂志的影响因子绝对不是学术水平的准确代表值,论文发表杂志影响因子只能作为一个学术水平的重要参考,相对准确的是该文章被引用的情况,当然也不能绝对代表研究水平。
《自然》一篇文章最近再次讨论了这个问题,并将学术量化评价的大致历史进行了回顾。
2000年以前只有美国科学信息研究所提供的科学引文索引SCI光盘版被一些专业人员使用进行文献计量分析。2002年汤森路透公司启用了SCI网络版,使这种工具的使用更方便。其他一些公司也相继建立了自己的类似学术评价平台,例如爱思唯尔2004年开始用Scopus,谷歌学术β版2004年开始用。这些都是基于文献引用作为基本指标的评价工具,不同的是收录的文献范围不同。SCI只分析被SCI收录的文献被其他SCI收录文献引用的情况,如果一篇文章被非SCI收录的文献引用,该系统就视而不见。Scopus的收录范围更大一些,但也是局限于其收录范围内。谷歌学术β版就毫无限制,只要有引用,全部统计。从全面性上看,谷歌学术最好,SCI最差;从准确度上看,谷歌学术最差,SCI最好;从时效性上看,谷歌学术最好,SCI最差。
也有利用网络数据比较各个学术机构的学术产出和影响力,例如基于SCI数据的InCites和基于Scopus的SciVal,也有利用谷歌数据的个人引用分析软件如2007年发布的Publish or Perish。
2005年,加州大学圣地亚哥分校物理学家Jorge Hirsch提出h-index,又称为h指数或h因子,是利用全部发表论文被引用的排序计算出的个人学术影响力指标。h代表“高引用次数”(high citations),一名科研人员的h指数是指他至多有h篇论文分别被引用了至少h次。h指数能够比较准确地反映一个人的学术成就。一个人的h指数越高,则表明他的论文影响力越大。例如,某人的h指数是20,这表示他已发表的论文中,每篇被引用了至少20次的论文总共有20篇。要确定一个人的h指数非常容易,到SCI网站(其他数据库也可以,会得出不同的数值),先查出某个人发表的所有SCI论文,按被引次数排序,往下核对,直到某篇论文的序号大于该论文被引次数,那个序号减去1就是h指数。而期刊影响因子引起关注的时间是起自1995年。可以这么说,h指数是个人学术影响力的判断指标,影响因子是期刊的影响力指标。
后来,随着社交网络评价的出现,2002年的F1000Prime,2008年的Mendeley,2011年的Altmetric.com(Altmetric是自然集团母公司Macmillan Science andEducation资助的)。
文献计量学家、社会科学家和学术管理机构,已经注意到现在对学术评价量化指标被滥用的情况。例如,一些大学排行榜如上海大学排行榜的和泰晤士报大学排行榜,甚至这些所谓排行榜采用一些不准确的数据和滥用指标。
一些招聘人员要求申请者提供h-index,许多大学对招聘岗位对受聘人员的h-index分值和高影响杂志论文数量都设定了标准。许多科学家也在自己的简历中以显著位置显示自己的h-index分值和高影响杂志论文数量,在生物医学领域这种情况最流行。导师让博士生去发表高影响期刊,因为这是他们将来混学术江湖的最好说明书。
斯堪的那维亚和中国的一些大学分配研究经费或奖金只根据一个数字,就是发表学术期刊的影响因子。
许多情况下,同行评议仍然发挥重要作用,但是滥用量化指标的现象已经变的非常普遍祖哲和恶劣。
文章后面提供了Leiden Manifesto科学计量10条原则,为避免误用,原文转发:
Wetherefore present the, named after the conference at which it crystallized (see
http://sti2014.cwts.nl). Its ten principles are not news to scientometricians,although none of us would be able to recite them in their entirety becausecodification has been lacking until now. Luminaries in the field, such asEugene Garfield (founder of the ISI), are on record stating some of theseprinciples3, 4. But they are not in the room when evaluators report back touniversity administrators who are not expert in the relevant methodology.Scientists searching for literature with which to contest an evaluation find thematerial scattered in what are, to them, obscure journals to which they lackaccess.
Ten principles
1) Quantitativeevaluation should support qualitative, expert assessment. Quantitativemetrics can challenge bias tendencies in peer review and facilitatedeliberation. This should strengthen peer review, because making judgementsabout colleagues is difficult without a range of relevant information. However,assessors must not be tempted to cede decision-making to the numbers.Indicators must not substitute for informed judgement. Everyone retainsresponsibility for their assessments.
2) Measureperformance against the research missions of the institution, group orresearcher. Programme goals should be stated at the start, and the indicators used toevaluate performance should relate clearly to those goals. The choice ofindicators, and the ways in which they are used, should take into account thewider socio-economic and cultural contexts. Scientists have diverse researchmissions. Research that advances the frontiers of academic knowledge differsfrom research that is focused on delivering solutions to societal problems.Review may be based on merits relevant to policy, industry or the public ratherthan on academic ideas of excellence. No single evaluation model applies to allcontexts.
3) Protectexcellence in locally relevant research. In many parts of the world, researchexcellence is equated with English-language publication. Spanish law, forexample, states the desirability of Spanish scholars publishing in high-impactjournals. The impact factor is calculated for journals indexed in the US-basedand still mostly English-language Web of Science. These biases are particularlyproblematic in the social sciences and humanities, in which research is moreregionally and nationally engaged. Many other fields have a national orregional dimension — for instance, HIV epidemiology in sub-Saharan Africa.
This pluralism andsocietal relevance tends to be suppressed to create papers of interest to thegatekeepers of high impact: English-language journals. The Spanish sociologiststhat are highly cited in the Web of Science have worked on abstract models orstudy US data. Lost is the specificity of sociologists in high-impactSpanish-language papers: topics such as local labour law, family health carefor the elderly or immigrant employment5. Metrics built on high-qualitynon-English literature would serve to identify and reward excellence in locallyrelevant research.
4) Keep datacollection and analytical processes open, transparent and simple. The constructionof the databases required for evaluation should follow clearly stated rules,set before the research has been completed. This was common practice among theacademic and commercial groups that built bibliometric evaluation methodologyover several decades. Those groups referenced protocols published in thepeer-reviewed literature. This transparency enabled scrutiny. For example, in2010, public debate on the technical properties of an important indicator usedby one of our groups (the Centre for Science and Technology Studies at LeidenUniversity in the Netherlands) led to a revision in the calculation of thisindicator6. Recent commercial entrants should beheld to the same standards; no one should accept a black-box evaluationmachine.
Simplicity is avirtue in an indicator because it enhances transparency. But simplistic metricscan distort the record (see principle 7). Evaluators must strive for balance— simple indicators true to the complexity of the research process.
5) Allow thoseevaluated to verify data and analysis. To ensure data quality, allresearchers included in bibliometric studies should be able to check that theiroutputs have been correctly identified. Everyone directing and managingevaluation processes should assure data accuracy, through self-verification orthird-party audit. Universities could implement this in their researchinformation systems and it should be a guiding principle in the selection ofproviders of these systems. Accurate, high-quality data take time and money tocollate and process. Budget for it.
6) Account forvariation by field in publication and citation practices. Best practice isto select a suite of possible indicators and allow fields to choose among them.A few years ago, a European group of historians received a relatively lowrating in a national peer-review assessment because they wrote books ratherthan articles in journals indexed by the Web of Science. The historians had themisfortune to be part of a psychology department. Historians and socialscientists require books and national-language literature to be included intheir publication counts; computer scientists require conference papers becounted.
Citation ratesvary by field: top-ranked journals in mathematics have impact factors of around3; top-ranked journals in cell biology have impact factors of about 30.Normalized indicators are required, and the most robust normalization method isbased on percentiles: each paper is weighted on the basis of the percentile towhich it belongs in the citation distribution of its field (the top 1%, 10% or20%, for example). A single highly cited publication slightly improves theposition of a university in a ranking that is based on percentile indicators,but may propel the university from the middle to the top of a ranking built oncitation averages7.
7) Base assessmentof individual researchers on a qualitative judgement of their portfolio. The older youare, the higher your h-index, even in the absence of new papers. The h-indexvaries by field: life scientists top out at 200; physicists at 100 and socialscientists at 20–30 (ref. 8). It is database dependent: there areresearchers in computer science who have an h-index of around 10 in theWeb of Science but of 20–30 in Google Scholar9. Reading and judging a researcher's work is much moreappropriate than relying on one number. Even when comparing large numbers ofresearchers, an approach that considers more information about an individual'sexpertise, experience, activities and influence is best.
8) Avoid misplacedconcreteness and false precision. Science and technology indicators are prone toconceptual ambiguity and uncertainty and require strong assumptions that arenot universally accepted. The meaning of citation counts, for example, has longbeen debated. Thus, best practice uses multiple indicators to provide a morerobust and pluralistic picture. If uncertainty and error can be quantified, forinstance using error bars, this information should accompany publishedindicator values. If this is not possible, indicator producers should at leastavoid false precision. For example, the journal impact factor is published tothree decimal places to avoid ties. However, given the conceptual ambiguity andrandom variability of citation counts, it makes no sense to distinguish betweenjournals on the basis of very small impact factor differences. Avoid false precision:only one decimal is warranted.
9) Recognize thesystemic effects of assessment and indicators. Indicators change the system throughthe incentives they establish. These effects should be anticipated. This meansthat a suite of indicators is always preferable — a single one will invitegaming and goal displacement (in which the measurement becomes the goal). Forexample, in the 1990s, Australia funded university research using a formulabased largely on the number of papers published by an institute. Universitiescould calculate the 'value' of a paper in a refereed journal; in 2000, it wasAus800(aroundUS480 in 2000) in research funding. Predictably, the number ofpapers published by Australian researchers went up, but they were in less-citedjournals, suggesting that article quality fell10.
10) Scrutinizeindicators regularly and update them. Research missions and the goals ofassessment shift and the research system itself co-evolves. Once-useful metricsbecome inadequate; new ones emerge. Indicator systems have to be reviewed andperhaps modified. Realizing the effects of its simplistic formula, Australia in2010 introduced its more complex Excellence in Research for Australiainitiative, which emphasizes quality.
本文引用地址:
http://blog.sciencenet.cn/blog-41174-884979.html 此文来自科学网孙学军博客,转载请注明出处。