Project leader
George Kampis, PhD, DSc


University Press Ltd.
Team leader: George Kampis, PhD, DSc

György Fábri, PhD, CSc

László Gulyás, PhD

Sándor Soós, PhD

Balázs Bálint, MSc

Zalán Szakolczi, BSc

Zoltán Szászi, BSc

HCCI Research Institute of Economics and Enterprises
Team leader: István János Tóth, PhD

Ágnes Czibik, MSc
Ágnes Makó, MSc
Tamás Uhrin, MSc

Zoltán Várhalmi, MSc

Glia Computer Consulting Ltd.
Team leader: Attila Bencsik,_MSc

Rita Ádám, MSc

Henriett Bagi, BSc

István Gráf, MSc

Computer and Automation Research Institute of the Hungarian Academy of Sciences
Team leader: András Benczúr, PhD

Eötvös Lóránd University
Team leader: Tamás Vicsek, member of the Hungarian Academy of Sciences

Dániel Ábel, MSc
András Barta, PhD

Krisztina Botos, MSc

Illés Farkas, PhD
Máté Nagy, MSc
Péter Pollner, PhD
Katharina A. Zweig, PhD

University of Szeged
Team leader: János Csirik, PhD, DSc

Gábor Berend, MSc

Richárd Farkas , PhD

Márk Jelasity, PhD

István Nagy


Use Cases

UseCase1(UnivPress) | UseCase2(GVI) | UseCase3(Glia)


Tracking and analyzing complex trends in the R&B sector (UnivPress)


The use case addressing science policy support is an integrative experiment to combine state-of-the art text mining, scientometrics (bibliometrics) and cybermetrics. The main aim is to detect and analyze the dynamics of the relationships between science policy, R&D developments and research financing. TEXTREND is intended to provide effective tools for a simultaneous analysis of the developments in scientific communication (e.g. research fronts) and in R&D policy (with special emphasis on the trends in research funding). The system is designed to mine the potentially latent relations between these aspects, supporting the decision making process w.r.t. the R&D sector.

The application utilizes the concept of "domain analysis", which is a combination of knowledge discovery techniques both from the fields of structured and unstructured information mining. The former is typically instantiated by the mining of scholarly (ISI WoS, Scopus etc.), or funding databases (e.g. CORDIS), while the latter includes the analysis of full text corpora harvested from the scholarly web. The use case developed for TEXTREND is to integrate these methods in a complementary manner, to gain a multi-faceted though well-integrated picture informed by many sources and methods (social network analysis, topic modelling and tracking, bibliometrics etc.). The prototipical components of this approach, ranging from structured to highly unstructured data processing, are demonstrated in the table below.

Dimension Basic method Variable Interpretation
Analysis of references Co-reference Author (ACA), Source Intellectual background
Inter-citation analysis Citation network Author, source, affiliation Organization of the field
Authorship patterns Co-authorship Author, affiliation Social network analysis
Keywords and classifications Co-word analysis Categories of keywords Research fronts
Analysis of abstracts and full text Topic modelling and tracking Extracted characteristic concepts Topical trends

The toolkit is to allow the user to conduct sectional, comparative and longitudinal or dynamics-centered analyses as well. Emphasis is placed on the concept of informative visualization, i.e. the output is intended to be readily interpretable and conveying a rich set of information for the analyst ("visual analytics"). Types of the visualization are demonstrated below, based on the technical method of analysis. (The particular figures belong to an ongoing research addressing a trendy topic in supraindividual biology, viz. phenotype plasticity.)

Descriptive statistics
Dimension reduction (PCA)
Dimension reduction (MDS)
Cluster analysis
Analysis of networks
Latent semantic analysis
Formal Concept Analysis


Tracking markets trends and processes based on text and web mining (GVI)


The main goal of the UC is to increase the effectiveness of reviewing and analyzing economic and social topics for analysts, policy makers and all the other stakeholders. The application is designed to support decision making in the fields of economic and public policies both in the governmental and the private sector.

The use case sets up a framework in which on-line content of popular and scholarly journals, and that of think tanks are harvested and filtered yielding documents relevant to the topic under study. These corpuses are then subjected to information extraction with state-of-the art natural language processing techniques, including named entity recognition, keyword-extraction, etc. Combined with a diverse set of metadata, the interrelations and dynamics of topics are to be tracked and analyzed.

The use case provides an excellent service for the joint research conducted by the UC owner GVI and the Corvinus University of Budapest addressing the phenomenon of corruption reported in the national media. The figures below demonstrate some of the results that illustrate the text mining approach.

Distribution of articles and cases by source

Distribution of articles by month of publication,
2006-2007 (N = 737)

Distribution of cases under study by types of institutions involved, 2006-2007.

Distribution of cases by case typology, 2006-2007.,


Trend and sentiment analysis of the blogosphere based on text mining and social network analysis


In recent years, developments in on-line communication have fundamentally reshaped
– the public media, and the use of the public media,
– the habits in the communities related to the public media,
– the use of information sources,
– the time of reaction of the media and
– the patterns and conventions in information processing.
These developments are, by now, fundamental factors in both the business and the governmental sector: traditional channels of PR communication have became much less effective, while on-line information flows significantly shorten the reaction time for the actors in the market. On the other hand, traffic at online community platforms, and their indirect effect on public opinion is being dramatically increased. As several examples show, the content of on-line platforms often infiltrate the official media, which makes it necessary for stakeholders to monitor these discussions. Traditional means, however, are insufficient to cope with such volumes of information: the the goal of the UC is to provide a service solution for this problem.

Beyond searching and navigation, the application is designed to serve the representation, analysis and confrontation of opinions that are present in the blogosphere. For the desired topic of interest, the application is to be capable of structuring and classifying the related textual information. Among the main goals is to implement sentiment analysis in blogs, i.e. the evaluation of the opinions in terms of attitudes (positive, negative, neutral) with respect to some topic, as well as to provide customizable representations of the relationships between the queried contents.

The application is utilized in three areas of use: (1) an on-line service that is capable of searching blogs, highlighting evaluations and opinions w.r.t. a queried topic. (2) Expert service to meet custom requirements, and provides analysis based upon the software implementation (analysis of the discourse about political questions, trends of the public opinion on particular products, etc.) (3) Blog marketing service (for blog service providers), including the detection of target groups interested in a particular topic or the utilization of the potential business opportunities induced by the interest in those topics.

The growth rate of the size of the blogosphere

Monitoring online community platforms
Detailed search engine interface
Blog visualization technologies 1.
Blog visualization technologies 2.