Draft Report
Task 2.1 of the SOEIS project
Rome Meeting
June 17-19, 1999
W
EBOMETRICS AND THE SELF-ORGANIZATION OF THE EUROPEAN INFORMATION SOCIETYMoses A. Boudourides, Beatrice Sigrist and Philippos D. Alevizos
mboudour@upatras.gr
bsigrist@wb.unizh.ch
ABSTRACT: Virtual space is a concrete material development of the information society. What could be revealed about this social structure? In parallel to bibliometrics and scientometrics, the first technique applied to books, the second to scientific articles, we explore webometrics as a methodology for the World-Wide Web. We will first present the technique and show how it can be used and then provide the European and the global information society as two examples in order to illustrate the methodology of webometrics. Moreover, we demonstrate the “triple-helix-ness,” that means the inter-relationship and connectivity between universities, governements and industries in both Europe and in the US. Among others, we conlcude that for the US governement is by far the most triple-helixed, while it seems that for Europe universities are the most triple-helixed.
INTRODUCTION
In the last two decades we are witnessing an increased significance of electronic communication, and in particular, of the Internet. Digital libraries, big distributed data bases, and other networked computerized resources tend to extend and replace traditional printed media. A common and easily accessed place on the Internet, where these resources are situated, is the World-Wide Web (from now on it will be referred to as the web). Seeing the web as a global book or a distributed collection of books necessitates the study of communicational patterns emerging in the new media.
Bibliometrics is the quantitative study of patterns in written communication as in books, journals and other printed material. When it refers to scientific production and communication, it is usually called scientometrics. This field was initiated in the sixties by the pioneering work of Derek de Solla Price, Maurice Goldsmith and Eugene Garfield. The common source of data for this analysis is the Science Citation Index (SCI) and the Social Science Citation Index (SSCI), both of the Institute for Scientific Information (ISI). To discern this intellectual structure of science, scientometric techniques were developed and used.
In this article, we want to explore and document a feasibility study of bibliometrics on the web. The aim is to test the extent to which scientometric analyses, as, for example, cocitation analysis, at the level of the web can yield coherent, interpretable results. In a sense, the web represents Cyberspace and it develops a social structure. What could be revealed about this structure? In what follows, we explore whether webometrics could be a methodology to analyze this structure. By webometrics we refer to the quantitative studies of electronic communication realized by the highly linked web. We illustrate our exploration into webometrics by two examples, one into the self-organization of the European Information Society and another into the Triple Helix constitution of insititutions, i.e., universities, governements, and industries in both Europe and the US.
CITATIONS
Analysis of citations is common in the sociology of science. Approaches to citations - citation patterns or citation behavior - allows to derive maps of the structure of scientific specialties or disciplines and helps construct typologies of different varieties of references and citations by content analysis (Gilbert 1977). Citations explore the structure of science. The primary idea goes back to Derek de Solla Price, who documented the growth of scientific literature in his book Little Science, Big Science (1963). This book became a classic, suggesting that science is not a unified whole, but a mosaic of specialty areas. This new understanding fostered an effort to map the intellectual structure of science. The techniques for this analysis were taken from bibliometrics. Bibliometrics refers to ‘all quantitative aspects and models of sciences communication, storage, dissemination and retrieval of scientific information’ (Wormell 1998, p. 259). Bibliometrics applied to scientific articles is called scientometrics.
In the classical scientometric technique of citation analysis the scope is to reveal the patterns of scientific communication and productivity which are embedded in the succession of references (citations) originating from some scientific publications. An early extension of this method was bibliographic coupling, that is, counting the number of references that a given pair of documents has in common. Special attention is given to the emergence, change, connectedness, and codification of specialty areas. The connectedness of specialty areas is revealed by cocitation analysis. Henri Small and his colleagues at the Institute for Scientific Information (ISI) were among the first to develop the techniques of cocitation analysis for scientific material (Small 1973).
Small and Griffith have proposed an analysis of citations for the purpose of exploring the specialty structure of science. They suggest that cocitation (the citation of two documents by a third) “identifies relationships between papers which are regarded as important by authors in the specialty and provides a natural and quantitative way to group or cluster the cited documents” (Small and Griffith in Gilbert 1977, p. 118). Cocitation analysis is based on the assumption that “the greater the number of times that a pair of documents are cited together, the more likely it is that they are related in content” (Bellardo 1980, p. 231). Thus, cocitation analysis reveals how knowledge is structured “by jointly tapping the individual perceptions of all the authors whose work has been examined” (Gilbert 1977, p. 119).
An essential step in advancing the methods of citation analysis was developed by White and Griffith (1981) in 1981. They introduced a new tool that was developed: author cocitation. Cocited author retrieval is actually retrieval based on the intersection of two oeuvres. The assumption of author cocitation is that the more frequently two authors are cited together, and the more similar their patterns of cocitations with others, the closer is the relationship between them. Cocited author searches are performed by specifying two authors and retrieving any paper that cites of both. Author cocitation analysis, therefore, is closely related to the cocited document clustering and mapping techniques associated with Small, Griffith and co-workers (Small and Griffith 1974, Griffith et al. 1974, etc.). Clusters generated by this type of analysis are believed to reflect research networks or specialties. Both approaches rely on the Institute for Scientific Information (ISI) citation database. And for both analysis, cluster analysis and mutlidimensional scaling are used to create two-dimensional maps of these citations indicating a scientific structure.
Further developments were confronted by the criticisms of citation/cocitation analyses as expressed by actor-network theorists. They have argued that citation/cocitation analyses embody the assumptions of the institutional sociology of science, because they focus on social relations and networks as suggested only by citation patterns. In contrast, the actor-network theorists developed an alternative analysis that, they believe, focuses more on the question of the content of science rather than its social relations or institutional side. Their proposal is to develop a scientometric analysis of cowords, i.e., co-occurring key words, usually a description of a document that appears in an abstract, title, or key word listing. Coword analysis is used to describe a network of interactions in science and, therefore, it does not privilege human actors in quite the same way that citation analysis might (e.g. Leydesdorff 1997, Fujigaki 1998, Fujigaki and Nagata 1998).
Why are we interested in citations? An interesting argument is that citations are a serious measure of persuasion and attention. In scientometrics, references would not be convincing as a matter of property rights or priority claims, nor as an argument that has not to be argued again, but as references of persuasion. Scientific papers could be interpreted as tools of persuasions (Gilbert 1977). “A scientist is rewarded through recognition for producing results which are seen as new, important, and true. These qualities are not self-evident to the readers of the paper. Authors show how their results advance other papers work. These papers have already been accepted as valid science and provide a measure of persuasive support for the newly announced findings” (Gilbert 1977, p. 116). To persuade the “scientificity” and newness of one’s results, the best way is to cite the most acknowledged, the most believed author. While citing is persuading, linking is fighting for attention. The term “to pay attention” incorporates a notion of payment. As Michael Goldhaber (1997, p. 182 in Jones 1999, p. 20) points out, the resource that is scarce and desirable in virtual space is attention. Thus, it seems to be truly ironic that the technology of the Internet gives great difficulty when we seek to understand its social interconnectedness.
In scientometrics a common estimate of scientific productivity is given by the so-called Price law, according to which, half of the scientific papers in a given research field are contributed by the square root of the total number of scientific authors. This is similar to the older Lotka’s law, which holds that the number of authors who produce n papers is inversely proportional to the square of n. Furthermore, there is Bradford’s law (or distribution) representing the pattern of unequal references across journals in a bibliography. In a sense, according to Bradford’s law for a search on a specific topic, a large number of the relevant articles will be concentrated in a small number of journal titles. In principle, one should expect to be able to extend these types of laws in webometrics.
In scientometrics one usually observes an "immediacy effect," the pattern whereby recent publications are cited more frequently than older ones. This effect is usually estimated by two measures: Price’s index and the citation half-life. The situation in webometrics appears to be quite different because of the relative minor importance of the aging effects on the rather restricted observation time period of the web resources. We believe that a number of other variables (as common language and culture, geographical and sociocultural affinities etc.) might play a similar differentiation role to webometrics as time obliteration does in scientometrics. (This idea is planned to be explored in a future work of ours.)
INFORMETRICS
In the 1980s the term informetrics was proposed for research in a broad sense. While bibliometrics and scientometrics refer to all quantitative aspects and models of printed media and sciences, informetrics is not limited to media or scientific communication (Almind and Ingwersen 1997, 404). Neither it is restricted to scientific research. However, it is considered usable for tasks such as issue management, gathering of business intelligence and research evaluation (Almind and Ingwersen 1997, p. 424). Informetrics is, thus, an emerging subfield in information sciences, which is based on the combination of advances of information retrieval and quantitative studies of information flows.
The application of informetric methods to the field of electronic communication would go back to William Paisley in 1990 (Almind and Ingwersen 1997, p. 404). The idea is to use the traditional bibliometric applications for the web that include the study of communication patterns, the identification of research fronts, historical studies of the development of a discipline or domain, and the evaluation of research activities of countries, institutions or individuals (Ingerwsen and Christensen 1997, p. 205). The web is implemented by the HyperText Transfer Protocol (HTTP) and the HyperText Markup Language (HTML), as a formatting tool. However, it is possible to use the HTTP commands and the HTML codes to search and retrieve information, and, thus, also to perform informetric analysis (Almind and Ingwersen 1997, p. 405-6). The idea, to regard the web in parallel to bibliometrics and scientometrics as a citation network, is quite new. Although different methods of distributed data bases searches are already widely used in an information retrieval context, however, very little is yet known in the context of semantic searches and content processing as the one attempted by webometrics.
How can informetric methods be used on the web? In fact, the opportunities for using informetric methods are not yet well elaborated. Informetric analyses are still on a stage of experimentation. On the one hand, instead of using scientific citation (ISI) based data, the applicability of online methodological approaches is tested. Compared to data bases, online would be fast, inexpensive, providing instant results, allowing for direct combination of domain-dependent and ISI-dependent data bases and reproducible (Ingwersen and Christensen, 1997, p. 206; Christensen, Ingwersen & Wormell 1997). On the other hand, it seems possible to use informetric methods on the web for both qualitative and quantitative analyses. A general informetric study could analyze web pages and their visibility as case studies (Wormell 1998, p. 263) or e-mail use and web-surveys (Jones 1999). For quantitative analysis, generally, the scheme of author(s) – article - references of scientometrics becomes now webserver or host (a root URL) - web page or site - hyperlinks in webometrics. Moreover, the standard tools for locating the webometric data are provided by one of the many existing search engines (for example, Alta Vista, Lycos, Yahoo, Infoseek, Excite, etc.).
A fist examination of web pages was done at Berkely University (Woodruff et. al 1996, p. 1) for the following characteristics: document size, number and types of tags, attributes, file extension, protocols, ports, number of in-links and ratio of document size to number of tags and attributes. The authors then summarized from a more limited set of documents the ten most used tags, the ten most common HTML errors, etc. The distribution of documents in the data set by domain was: Other: 41%, Com: 20%, Edu: 27%, Gov: 4%, Net: 4%, Org: 3%, Mil: 1%, where “other” includes all domains other than the given top-level domains (Woodruff et. al 1996, p. 4). Besides their results, the authors come up with three conclusions to their experiences. First, dealing with large data sets on the order of millions of documents is difficult and time-consuming, second, the web changes exceptionally quickly and, last, longitudinal studies for examining trends would be interesting in oder to find out which characteristics change fairly quickly and which ones more slowly and on how they react to new tools that are introduced (Woodruff et al. 1996, p. 14).
Another early attempt to systematically apply bibliometric methods to the Internet was to study the behavior of this interactive media. Bar-Ilan has examined the reactions to the “mad cow disease” in the Usenet newsgroups with emphasis on volumes and distributions (Bar-Ilan 1997, p. 29). “Bibliometric laws set a formal framework for studying large volumes of literature” (Bar-Ilan 1997, p. 31). The purpose of her study is the examination of the applicability of bibliometric laws to the new media of “publication” and communication. The example taken is the discussion of the mad cow disease in the Usenet newsgroups. The Usenet newsgroups are a worldwide electronic bulletin board system. Alta Vista was used to locate the newsgroups discussing topics related to the mad cow disease. The type of services provided by Alta Vista determined the applied methodology. In analogy to bibliometrics, in which the growth function of scientific literature indicates trends in scientific activity, the growth function of news items indicates changes in attitude and interest of the discussants. Furthermore, “if the news items showed a concentration similar to that of papers in scientific journals, it would indicate that there is a natural distribution of messages among relevant newsgroups, even though no policy is enforced” (Bar-Ilan 1997, p. 31). The refereeing process is totally missing for newsgroups, as well as the fact that scientific journals have a limit on the number of papers published per issue and on the number of issues in a year. However, newsgroup members also compete for attention and do not post too many messages to a given newsgroup not to bother the readers (Bar-Ilan 1997, p. 46).
According to many studies conducted around the Royal School of Library and Information Sciences in Copenhagen, Denmark, an interesting quantitative viewpoint investigation has followed the same steps as possible as for citation studies of scientific data bases. In this way, these studies focus on the application of informetric methods to the web and in particular they investigate the adaptation of the impact factor as the most common instrument in informetrics on the web. In bibliometrics, the impact factor evaluates international scientific journals in order to measure their impact in the field. These Danish studies innovate on the one hand how the international impact factor could be extended or which analysis is estimated as having a good reliability. For example, in her study titled “how international are international journals?” Wormell introduces a new dimension of quantitative analysis to the impact factor analysis. She, therefore, bases the analysis of seven international journals not only on correlations between the geographic distribution patterns of authors, citations and subscriptions, but, additionally, she examines the knowledge export of each journal to other disciplines. Wormell is testing for internationality not only by the visibility and impact but also by the newly introduced variable “knowledge export” (Wormell 1998a). On the other hand, the Danish concern is to test the reliability of the impact factor on the web. In his study, Ingwersen (1998) has tested national, sector, and institutional impact factors on the web for their feasibility and reliability. His study shows high confidence for national and sector domains, while institutional sectors would be less reliable (Ingwersen 1998, p. 243).
WEBOMETRICS
Almind and Ingwersen (1997) describe the use of traditional informetric methods as a starting point for analysis on the web as generally conceivable for any kind of statistical aspects (language, word, phrase frequencies), characteristics of authors, their productivity and the degree of their collaboration, as well as citation analysis for the distribution over authors, institutions, and for the measure of growth of a subject or a database, and concomitant growth of new concepts, definition and measurement of information and types and characteristics of retrieval performance measures.
Among the first studies to explore the explosive growth and the “bibliometrics on the web” are the studies of Abraham (1996, 1996a) and Larson (1996). Their examination, thus, must be interpreted as the first attempt to apply cocitation analysis to the vast and growing hypertext network of a virtual space. Taking the web as a “the distributed digital libraries of the future” (Larson 1996), they follow the tools and techniques developed for the analysis of intellect structure in paper-based libraries to make the transition to this network-based environment.
Abraham (1996, 1996a) has undertaken several steps into what he calls “webometry” aiming to the construction of cognitive maps and mathematical models of the web. Among them an interesting technique, he describes, is a complete, step-by-step procedure for the measurement of the connectivity of a sub-web of the web.
For an exploration into the intelligence and morphogenesis of the web, Abraham selects groups of domains as the nodes of a network and uses Alta Vista search engine to indicate the number of pages at each node, and the number of links between one node and another. Abraham considers the domain names as the nodes and the connections as the links.“Given two domains, that is, nodes, we must determine all links from any page of the first domain, to any page of the second domain” (Abraham 1996). Next, he calculates a relative density and records all pairs of nodes in a matrix. As an illustration he takes the University of California system, which contains 9 universities and respectively 9 URLs. From the searches with Alta Vista (using an advanced query of the form “host:U(i) AND link:U(j)"), he obtains the raw data for the connectivity matrix. He records the number of HTML documents containing one or more links to any document of another node, for which Alta Vista indicates the connectivity number. The results generate a 9 by 9 matrix as the produced synergy matrix, for which he adopts a simple gray scale to represent the synergy data (Abraham 1996).
Larsons's study (1996) is based on both an analysis of over 30 gigabytes of web pages collected by a web-crawler program used to generate the indexes for a network search engine and on the use of the Alta Vista search engine. On a set of Earth Science related web sites, he conducts a cocitation analysis and, then, examines the statistical characteristics of web documents and their links. Larson first selects the core set of items, retrieves cocitation frequency information for this core set identified, compiles the raw cocitation frequency matrix, and uses correlation analysis to convert the raw frequencies into correlations coefficients. Next, he is running a multivariate analysis program on the correlation matrix (multidimensional scaling or MDS), and, subsequently, he tries to interprete the resulting maps (Larson 1996).
Larson produced his core set of web sites focused on geographic information systems, earth sciences and satellite remote sensing by the following very interesting sampling technique. First, he submitted the following search to Alta Vista: “link:pubweb.parc.xerox. com/map AND link:xtreme.gsfc.nasa.gov.” This search aimed to find a set of web documents containing links to both the Xerox Map browser and the home page for NASA's Advanced Very High Resolution Radiometer remote sensing projects (Larson 1996). This resulted in a set of 115 web pages, out of which 43 pages were retained. Individual home pages, bibliography pages and the like, subsequently, were excluded. From these 43 pages all of the links to other pages were extracted. This resulted in 7209 alphabetically sorted individual URLs. Larson eliminated links that occurred in less than 3 of the citing documents as well as citations appearing to be outside the boundary of the set. Considering this set of 332 potential candidates still as too large, a hotlist of only the “best” sites, based on his judgement, were retained, reducing the core set to 34 sites (Larson 1996). From this core set of web sites he produced his raw cocitation matrix. He programmed a “web robot” to carry out the many searches needed for the raw cocitation matrix. The robot was querying the web for 5 hours.
METHODOLOGY
We have been using the Alta Vista search engine for our webometric analyses. The following table describes the search commands we have been employing:
|
Keyword |
Function |
|
host:name |
Finds pages on a specific computer. The search host:altavista.digital.com would find pages on the AltaVista computer, and host:dilbert.unitedmedia.com would find pages on the computer called dilbert at unitedmedia.com. |
|
link:URLtext |
Finds pages with a link to a page with the specified URL text. Use link:altavista.digital.com to find all pages linking to AltaVista. |
|
text:text |
Finds pages that contain the specified text in any part of the page other than an image tag, link, or URL. The search text:cow9 would find all pages with the term cow9 in them. |
In addition to these Alta Vista commands we have been using Alta Vista’s advanced search capabilities together with the following Boolean operators:
|
Keyword |
Action |
|
AND |
Finds only documents containing all of the specified words or phrases. Mary AND lamb finds documents with both the word Mary and the word lamb. |
|
OR |
Finds documents containing at least one of the specified words or phrases. Mary OR lamb finds documents containing either Mary or lamb. The found documents could contain both, but do not have to. |
In this way, we have done a first exploration of webometrics, which is described in the subsequent sections. In fact, we have applied webometrics to two concrete case studies. The first refers to the self-organization of the European Information Society and the second to the measure of the “Triple-Helixness” (co-evolution of the three institutions: university, governement, and industry). In these webometric case studies, our data from the web were processed by two different statistical methods. In the first case study we have been using a multidimensional scaling analysis (MDS), while in the second one a multiple correspondence analysis.
At this point, we should remark that carrying out a comprehensive data collection implies difficulties and shortcomings. Generally, it is acknowledged that the better the search engine, the better is the search (Almind and Ingwersen 1997, p. 423). Many agree that Alta Vista already has come up with more sophisticated search possibilities. The usual problems that are encountered by web searches are mainly related to the lack of web pages structure. Web pages have not any enforced conformity of form and content and the search parameters cannot strictly be defined (e.g., europe, europa). Therefore, preparatory time consuming work is necessary to carry out a quantitative webometric analysis. However, it might be not so easy to sort or identify the web pages one wishes to investigate (Almind and Inwersen 1997, p. 408). Specially clusters are difficult to find that are not cited or do not contain citations or, more likely, are not cited outside the cluster or the group (Almind and Ingwersen 1997, p. 424). Second, all citations are retrospective, whereas the web is constantly in real time (Almind and Ingwersen 1997, p. 406).
A WEBOMETRIC MULTIDIMENSIONAL SCALING ANALYSIS OF
THE SELF-ORGANIZATION OF
THE EUROPEAN INFORMATION SOCETY
DATA COLLECTION AND DATA PROCESSING
In this case study, our aim is to investigate how much connected and inter-related are certain institutions participating in the self-organization of the European information society. For this purpose and in order to perform a first webometric analysis, which can be later tested on a bigger set of data, we decided to include in our sample data entities from Europe and world-wide organizations, from educational institutions participating in the dynamics of the information society, and from research institutions specialized in complexity and self-organization theories.
Moreover, since our investigation is implemented on the web, we have focused our attention to certain web servers at the level of sub-domains, which from now will be denoted as sdws. Clearly a sdws contains a big number of web servers, all of which belong to the same sub-domain. For example, uva.nl is the sdws of the University of Amsterdam, which constitutes a sub-domain of the Netherlands (nl); apparently, the sdws uva.nl includes the web server www.uva.nl of the whole University of Amsterdam, the web server www.psy.uva.nl of the Psychology Department, the web server www.chem.uva.nl of the Chemistry Department, etc; it is obvious that the sdws uva.nl is both representing and comprehending at the higher local level all of the web servers at the University of Amsterdam.
Coming now to the specific sdws, which we have considered in this webometric case study, we have chosen the following ones:
|
SDWS |
Code Name |
|
unizh.ch or unilu.ch |
CH |
|
uniroma1.it or unibo.it |
IT |
|
uva.nl or vu.nl |
NL |
|
surrey.ac.uk |
UK |
|
uni-bielefeld.de |
DE |
|
go.jp |
JP |
|
duth.gr or upatras.gr |
GR |
|
eu.int or cordis.lu |
EU |
|
unesco.org |
UN |
|
santafe.edu or lanl.org |
CO |
As one can see we have concentrated 10 sdws or better groups of sdws. The first seven come from the European Universities participating at the SOEIS project together with the collaborating Japanese University to this project, the eighth represents European Union, the ninth UNESCO and the tenth two very well known complexity and self-organization theories sdws. (The “or” syntax will be made clear in the employed search commands below.)
Of course, this is only a first exploration of a webometric analysis of the self-organization of the European information society and to test our methodology we have restricted ourselves to just 10 sdws. In a subsequent more detailed analysis, we intend to include more sdws and even to use the above-described Larson’s sampling technique.
Having obtained a characteristic core set of sdws in the area of the self-organization of the European information society, what comes next is to produce a raw cocitation matrix. To do so, we have searched the web using the Alta Vista command of the form:
link:sdws(i) AND link:sdws(j), for i,j = 1, ..., 10,
where in case of a group of sdws we should have at least one of the above terms:
link:sdws(m) OR link:sdws(n).
In this way, the 10x10 matrix of raw cocitation data was produced. Since this matrix is symmetrical, its lower triangular form was:
|
CH |
|||||||||
|
IT |
624 |
IT |
|||||||
|
NL |
999 |
1282 |
NL |
||||||
|
UK |
330 |
476 |
1351 |
UK |
|||||
|
DE |
648 |
377 |
872 |
308 |
DE |
||||
|
JP |
713 |
576 |
1535 |
639 |
407 |
JP |
|||
|
GR |
100 |
200 |
293 |
183 |
106 |
147 |
GR |
||
|
EU |
470 |
910 |
1260 |
499 |
535 |
2830 |
160 |
EU |
|
|
UN |
100 |
199 |
398 |
166 |
73 |
870 |
24 |
2251 |
UN |
|
CO |
240 |
169 |
726 |
244 |
177 |
662 |
122 |
211 |
55 |
To be more specific the number at the position (i,j) in the above matrix denotes the number of web pages containing links to both the i-th and the j-th sdws’s.
RESULTS
The raw cocitation matrix was entered in the SAS system first to be converted to the following correlation matrix:
|
CH |
1.000 |
|||||||||
|
IT |
0.813 |
1.000 |
||||||||
|
NL |
0.789 |
0.751 |
1.000 |
|||||||
|
UK |
0.859 |
0.895 |
0.967 |
1.000 |
||||||
|
DE |
0.920 |
0.940 |
0.690 |
0.820 |
1.000 |
|||||
|
JP |
0.319 |
0.692 |
0.412 |
0.439 |
0.511 |
1.000 |
||||
|
GR |
0.769 |
0.822 |
0.849 |
0.896 |
0.700 |
0.284 |
1.000 |
|||
|
EU |
0.262 |
0.189 |
0.282 |
0.271 |
0.044 |
0.555 |
-0.141 |
1.000 |
||
|
UN |
0.154 |
0.488 |
0.448 |
0.193 |
0.276 |
0.941 |
0.077 |
0.991 |
1.000 |
|
|
CO |
0.791 |
0.714 |
0.693 |
0.869 |
0.677 |
0.325 |
0.658 |
0.445 |
0.123 |
1.000 |
The purpose of the above step was to change raw co-occurrence information to values that can be treated as a form of topical proximity between sdws’s. Sdws’s of high correlations can be seen as similar to one another.
The next step was to process the above correlation matrix using the SAS Multidimensional Scaling (MDS) analysis. This is a statistical technique for uncovering “hidden structure” in data bases. The result of applying MDS leads to the output of a spatial representation of the data. In fact, we obtain the following 3 projections of our data on three two-dimensional planes:

Plot 1

Plot 2

Plot 3
There are two important conclusions, which can be drawn from Plots 1 and 2: (i) The data of the European universities and complexity sdws's are clustered in a different area than the group of the European, the UN and the Japanese university sdws's. (ii) The axis corresponding to Dimension 1 could be interpreted as indicating a decreasing transition from more complex sdws's like EU and UN to less complex University sdws's like GR.
A WEBOMETRIC MULTIPLE CORRESPONDENCE ANALYSIS
OF THE TRIPLE HELIX DYNAMICS
DATA COLLECTION AND DATA PROCESSING
In this case study our aim was to do a webometric analysis of 112 sdws (web servers at the level of sub-domains), each one of which was characterized by 5 properties or categorical variables. The first 3 variables refer to whether a sdws possesses links to Triple Helix institutions. So, in particular, the first variable refers to Govermental links, the second to University links and the third to Industrial links. The fourth variable is the number of incoming links, i.e., links towards the considered sdws from the rest of the Internet (or better the web). Apparently, the fourth variable is related to the traffic towards the sdws or its visibility in the web. The last fifth variable describes the specific type of the considered sdws. In summary, these were are variables for our 112 data (sdws):
Using the Alta Vista search command host:"host" AND link:"link" we were determining whether a sdws possesses links to Triple Helix, in which case we were assigning the value 1 to it, or it does not possess links to Triple Helix, in which case we were assigning the value 2 to it. Moreover, by the command link:"host" we were determining the number of incoming links to a sdws. Finally, the type of a sdws was characterized by its contents. In this way, the following data were obtained:
|
Domains |
Government |
University |
Industry |
Links |
Type |
|
unizh.ch |
1 |
1 |
1 |
23163 |
u-e |
|
uniroma1.it |
2 |
1 |
1 |
10426 |
u-e |
|
unibo.it |
1 |
1 |
1 |
23371 |
u-e |
|
surrey.ac.uk |
1 |
1 |
1 |
25390 |
u-e |
|
uni-bielefeld.de |
1 |
2 |
1 |
11777 |
u-e |
|
duth.gr |
1 |
2 |
2 |
3501 |
u-e |
|
upatras.gr |
1 |
1 |
1 |
3316 |
u-e |
|
uva.nl |
1 |
1 |
1 |
33596 |
u-e |
|
vu.nl |
1 |
1 |
1 |
30571 |
u-e |
|
harvard.edu |
1 |
1 |
1 |
201464 |
u-us |
|
jhu.edu |
1 |
1 |
1 |
81124 |
u-us |
|
mit.edu |
1 |
1 |
1 |
470010 |
u-us |
|
princeton.edu |
1 |
1 |
1 |
102454 |
u-us |
|
ucla.edu |
1 |
1 |
1 |
120479 |
u-us |
|
berkeley.edu |
1 |
1 |
1 |
360740 |
u-us |
|
santafe.edu |
1 |
1 |
2 |
9481 |
u-us |
|
duke.edu |
1 |
1 |
1 |
92543 |
u-us |
|
yale.edu |
1 |
1 |
1 |
112080 |
u-us |
|
cornell.edu |
1 |
1 |
1 |
220283 |
u-us |
|
ntua.gr |
1 |
1 |
1 |
14047 |
u-e |
|
sorbonne.fr |
2 |
2 |
2 |
268 |
u-e |
|
uni-heidelberg.de |
1 |
1 |
1 |
30184 |
u-e |
|
cam.ac.uk |
1 |
1 |
1 |
68843 |
u-e |
|
ethz.ch |
1 |
1 |
1 |
69975 |
u-e |
|
univie.ac.at |
1 |
1 |
1 |
43288 |
u-e |
|
stanford.edu |
1 |
1 |
1 |
169489 |
u-us |
|
caltech.edu |
1 |
1 |
1 |
57474 |
u-us |
|
umn.edu |
1 |
1 |
1 |
138605 |
u-us |
|
columbia.edu |
1 |
1 |
1 |
104825 |
u-us |
|
arizona.edu |
1 |
1 |
1 |
90344 |
u-us |
|
wsj.com |
2 |
2 |
1 |
46715 |
m-us |
|
iht.com |
1 |
2 |
1 |
4479 |
m-us |
|
washingtonpost.com |
1 |
1 |
1 |
117325 |
m-us |
|
latimes.com |
1 |
1 |
1 |
69828 |
m-us |
|
nytimes.com |
2 |
2 |
2 |
72045 |
m-us |
|
chicagotribune.com |
1 |
2 |
2 |
14954 |
m-us |
|
ft.com |
2 |
2 |
2 |
32716 |
m-us |
|
newsweek.com |
1 |
1 |
2 |
8465 |
m-us |
|
boston.com |
1 |
1 |
1 |
42638 |
m-us |
|
csmonitor.com |
1 |
1 |
1 |
40364 |
m-us |
|
nzz.ch |
1 |
2 |
2 |
7061 |
m-e |
|
lastampa.it |
2 |
2 |
2 |
4091 |
m-e |
|
guardian.co.uk |
2 |
2 |
2 |
15626 |
m-e |
|
faz.de |
2 |
2 |
2 |
3063 |
m-e |
|
lemonde.fr |
2 |
2 |
2 |
11744 |
m-e |
|
the-times.co.uk |
2 |
2 |
2 |
22292 |
m-e |
|
elpais.es |
2 |
2 |
2 |
12117 |
m-e |
|
gelderlander.nl |
2 |
2 |
2 |
855 |
m-e |
|
volkskrant.nl |
2 |
2 |
2 |
4134 |
m-e |
|
berlinermorgenpost.de |
2 |
2 |
2 |
1 |
m-e |
|
microsoft.com |
1 |
1 |
1 |
2657192 |
i |
|
sun.com |
1 |
1 |
1 |
198681 |
i |
|
linuxworld.com |
2 |
2 |
1 |
22705 |
i |
|
ibm.com |
1 |
1 |
1 |
251551 |
i |
|
apple.com |
1 |
1 |
1 |
210052 |
i |
|
hp.com |
1 |
1 |
1 |
109584 |
i |
|
sgi.com |
1 |
1 |
1 |
79137 |
i |
|
toshiba.com |
2 |
2 |
1 |
165 |
i |
|
ericsson.com |
2 |
2 |
2 |
7069 |
i |
|
nokia.com |
2 |
2 |
1 |
15841 |
i |
|
compaq.com |
1 |
1 |
1 |
57743 |
i |
|
philips.com |
2 |
1 |
2 |
29495 |
i |
|
sony.com |
2 |
1 |
1 |
96230 |
i |
|
amazon.com |
2 |
2 |
2 |
838950 |
i |
|
xerox.com |
2 |
2 |
1 |
122 |
i |
|
jnj.com |
2 |
2 |
2 |
2729 |
i |
|
shell.com |
1 |
2 |
1 |
3946 |
i |
|
swissre.com |
2 |
1 |
1 |
812 |
i |
|
bankofamerica.com |
2 |
2 |
2 |
231 |
i |
|
bankofny.com |
2 |
2 |
2 |
1085 |
i |
|
ubs.ch |
2 |
2 |
2 |
941 |
i |
|
csg.ch |
2 |
2 |
2 |
103 |
i |
|
unesco.org |
1 |
1 |
2 |
20907 |
ngo |
|
oecd.org |
1 |
2 |
1 |
16606 |
ngo |
|
greenpeace.org |
1 |
2 |
1 |
20045 |
ngo |
|
wwf.org |
1 |
1 |
1 |
4239 |
ngo |
|
amnesty.org |
1 |
1 |
1 |
15832 |
ngo |
|
un.org |
1 |
1 |
1 |
43320 |
ngo |
|
hrw.org |
1 |
1 |
1 |
7282 |
ngo |
|
ihf-hr.org |
1 |
1 |
2 |
502 |
ngo |
|
humanweb.org |
1 |
2 |
2 |
955 |
ngo |
|
redcross.org |
1 |
1 |
1 |
10752 |
ngo |
|
ifaw.org |
1 |
2 |
2 |
585 |
ngo |
|
nthp.org |
2 |
2 |
2 |
1766 |
ngo |
|
oxfam.org |
2 |
2 |
2 |
1236 |
ngo |
|
wilpf.org |
1 |
2 |
2 |
115 |
ngo |
|
apc.org |
1 |
1 |
1 |
109242 |
ngo |
|
eff.org |
1 |
1 |
1 |
102171 |
ngo |
|
igc.org |
1 |
1 |
1 |
43767 |
ngo |
|
oneworld.org |
1 |
1 |
1 |
33376 |
ngo |
|
activistnet.org |
1 |
2 |
1 |
1598 |
ngo |
|
msf.org |
2 |
2 |
2 |
2038 |
ngo |
|
snb.ch |
2 |
2 |
2 |
228 |
g-e |
|
bancaditalia.it |
2 |
2 |
2 |
351 |
g-e |
|
dnb.nl |
2 |
2 |
2 |
804 |
g-e |
|
bankofengland.co.uk |
2 |
2 |
2 |
2297 |
g-e |
|
bundesbank.de |
2 |
2 |
2 |
2281 |
g-e |
|
bankofgreece.gr |
2 |
2 |
2 |
13 |
g-e |
|
bofi.fi |
2 |
2 |
2 |
581 |
g-e |
|
banque-france.fr |
2 |
2 |
2 |
931 |
g-e |
|
bde.es |
2 |
2 |
2 |
1155 |
g-e |
|
oenb.co.at |
2 |
2 |
2 |
226 |
g-e |
|
whitehouse.gov |
1 |
1 |
1 |
5243 |
g-us |
|
nsf.gov |
1 |
1 |
1 |
10456 |
g-us |
|
dhhs.gov |
1 |
1 |
1 |
10353 |
g-us |
|
usda.gov |
1 |
1 |
1 |
40893 |
g-us |
|
nasa.gov |
1 |
1 |
1 |
147749 |
g-us |
|
epa.gov |
1 |
1 |
1 |
202381 |
g-us |
|
ftc.gov |
1 |
1 |
1 |
12177 |
g-us |
|
ustreas.gov |
1 |
1 |
1 |
36134 |
g-us |
|
dol.gov |
1 |
1 |
1 |
13649 |
g-us |
|
state.gov |
1 |
1 |
1 |
36037 |
g-us |
Denoting Gov=Governement, Uni=University and Ind=Industry, the three Triple Helix variables take the following six categories (respectively):
Breaking the range of the number of links towards the considered sdws in regular subintervals, the Incoming Links variables take the following three categories:
Moreover, the types of sdws were codified in the following categories:
RESULTS
The above data were processed by the MINI-TAB software and what we are interested in doing is multiple correspondance analysis. According to this analysis, our variables are related by their coordinates on a two-dimensional plane. The more closed they are, the more they are inter-related. In this way, we can observe clusters of their correlations. Moreover, the axes of the plane diagrams can be identified by the relative position of the variables on them.

Plot 4
Plot 4 gives the result of a multiple correspondence analysis, in which the principal variables were the Triple Helix variables and the other two were the complementary variables. This result is represented on the plane of the two first factorial axes. Because of the relative positions of the 6 Triple Helix categories, clearly we can deduce the following:
Let us also remark that the categories of Incoming Links are growing along the horizontal axis similarly to the development of the Triple Helix activities.
Furthermore, the vertical axis of Plot 4 yields some minor variations, which tend to support the following:

Plot 5
Plot 5 gives the result of a second multiple correspondence analysis, in which the principal variables were the Triple Helix and the Incoming Links, while the types were the complementary variable. This result is represented on the plane of the two first factorial axes. Here we are observing the information, which was extracted from the previous Plot 4. However, we get from this plot a stronger interpretation of the vertical axis in terms of the categories of Medium and High Links:
In addition, another interesting result manifested in this plot refers to the relative position of certain types of sdws with respect to positive or negative Triple Helix activity:
REFERENCES
Abraham, Ralph, H.(1996a): WEbometry. Chronotopography of the World Wide Web. http.//www.vismath.org/ralph/articles/MS%2389.Web3/
Abraham, Ralph, H. (1996): Webometry: measuring the complexity of the World Wide Web. http.//www.vismath.org/ralph/articles/MS%2385.Web1/
Almind, Tomas, C.; Ingwersen, Peter (1997): Informetric analyses on the world wide web: Methodological Approaches to 'Webometrics'. In: Journal of Documentation, Vol 53:4, 404-426.
Bar-Ilan, Judit (1997): The 'Mad Cow Disease, Usenet newsgroups and bibliometric laws. In: Scientometrics, Vol 39:1, 29-55.
Bellardo, T. (1980): The use of co/citations to study science. Library Research, 2, 231-237.
Callon, M.; Courtial, J.P., Laville, F. (1991): Co-word Analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemistry. In: Scientometrics, Vol 22:1, 155-205.
Cano, V.; Lind, N.C. (1991): citation life cyces of ten citation classics. In: Scientometrics, Vol 22:2, 297-312.
Christensen, F.Hjortgaard; Ingwersen, P.; Wormell, Irene (1997): Online Determination of the Journal Impact Factor and its international Properties. In: Scientometrics, Vol 40:3, 529-540.
Fujigaki, Yuko (198): The Citation System: Citation networks as repeatedly focusing on difference, continuous re-evaluation, and as persistent knowledge accumulation. In: Scientometrics, Vol 43:1, 77-85.
Fujigaki, Yuko; Nagata, Akiya (1998): Concept evolution in science and technology policy: the process of change in relationships among university, industry and government. In: Science and Public Policy, 387-395.
Gilbert, Nigel, G. (1977): References as Persuasion. In: Social Studies of Sciences, Vol 7, 113-122.
Griffith, B.C.; Small, H.G.; Stonehill, Judith, A.; Dey, Sandra (1974): The structure of scientific literatures. II: Toward a macro and microstructure for science. Science Studies, 4, 339-365.
Ingwersen, Peter (1998): The Calculation of Web Impact Factors. In: Journal of Documentation, Vol 54:2, 236-243.
Ingwersen, Peter; Christensen, Finn Hjortgaard (1997): Data Set Isolation for Bibliometric Online Analyses of Research Publications: Fundamental Methodological Issues. In: Journal of the American Society for Information Science. Vol 48:3, 205-217.
Jones, Steve (1999) Doing Internet Research. Critical Issues and Methods for Examining the Net. Sage
Jones, Steve (1999): Studying the Net: Intricacies and Issues. In: Jones, Steve (1999) Doing Internet Research. Critical Issues and Methods for Examining the Net. Sage, 1-29.
Lenoir, Timotty (1979): Quantitative Foundations for the Sociology of Science: On linking Blockmodeling with Co-Citation Analysis. In: Social Studies of Science, Vol 9, 455-480.
Leydesdorff, Loet (1997): Why Words and Co-words Cannot Map the development of the Sciences. In: Journal of the American Society for Information Science. Vol 48:5, 418-427.
Leydesdorff, Loet; Etzkowitz, Henry (1998): Triple Helix of innovation: introduction. In: Science and Policy, Vol 25:6,
Price, D.J. de Solla (1963): Little Science, Big Science. New York: Columbia University Press.
Rip, A.; Courtial, J.-P. (1984): Co-word Maps of Biotechnology: An example of cognitive scientometrics. In: Scientometrics Vol 6:6, 381-400.
Small, H. G.; Griffith, B.C. (1974): The structure if scientific literatures. I: Identifying and praphing specialties. In: Sciences Studies, 4, 17-40.
Small, H.G. (1973): Co-citation in the scientific literature: A new measure of the relationship between tow documents. In: Journal of the American Society for Information Science, 24, 265-269.
White, H.D.; Griffith, B.C. (1981): Author cocitation: A literature measure of intellectual structure. In: Journal of American Society for Information Science, 32, 163-171.
Woodruff, Allison; Aoki, Paul, M.; Brewer, Eric; Gauthier, Paul; Rowe, Lawrence A. (1996): An Investigation of Documents from the World Wide Web. Fifth International World Wiede Web Conference, May 6-10,1996, Paris, France. HYPERLINK http://www5conf.imria.fr/fich_html/papers/P7/Overview.html http://www5conf.imria.fr/fich_html/papers/P7/Overview.html
Wormell, Irene (1998): Informetrics: an emergin subdiscipline in information science. In: Asian Libraries, Vol 7:10,257-268.
Wormell, Irene (1998a): Informetric Analysis of the international impact of scientific journals: How 'international' are the international journals? In: Journal of Documentation, Vol 54:5, 584-605.