×

You are using an outdated browser Internet Explorer. It does not support some functions of the site.

Recommend that you install one of the following browsers: Firefox, Opera or Chrome.

Contacts:

+7 961 270-60-01
ivdon3@bk.ru

  • Contextual-Diffusion Method for Enriching the TF-IDF Matrix to Enhance Semantic Coherence of Topic Models in News Text Corpora

    The article addresses a significant limitation of the classic TF-IDF method in the context of topic modeling for specialized text corpora. While effective at creating structurally distinct document clusters, TF-IDF often fails to produce semantically coherent topics. This shortcoming stems from its reliance on the bag-of-words model, which ignores semantic relationships between terms, leading to orthogonal and sparse vector representations. This issue is particularly acute in narrow domains where synonyms and semantically related terms are prevalent. To overcome this, the authors propose a novel approach: a contextual-diffusion method for enriching the TF-IDF matrix.
    The core of the proposed method involves an iterative procedure of contextual smoothing based on a directed graph of semantic proximity, built using an asymmetric measure of term co-occurrence. This process effectively redistributes term weights not only to the words present in a document but also to their semantic neighbors, thereby capturing the contextual "halo" of concepts.
    The method was tested on a corpus of news texts from the highly specialized field of atomic energy. A comparative analysis was conducted using a set of clustering and semantic metrics, such as the silhouette coefficient and topic coherence. The results demonstrate that the new approach, while slightly reducing traditional metrics of structural clarity, drastically enhances the thematic coherence and diversity of the extracted topics. This enables a shift from mere statistical clustering towards the identification of semantically integral and interpretable themes, which is crucial for tasks involving the monitoring and analysis of large textual data in specialized domains.

    Keywords: thematic modeling, latent Dirichlet placement, TF-IDF, contextual blurring, semantic proximity, co-occurrence, text vectorization, bag of words model, thematic coherence, natural language processing, silhouette coefficient, text data analysis

  • U-shaped distribution of topic intensity in the latent Dirichlet allocation model: distribution density function and parameter identification method

    The article is devoted to the description and mathematical justification of the U-shaped distribution of topic shares that arises in the latent Dirichlet allocation model with symmetric hyperparameters. It is shown that the bimodal shape is due to the reduction of the Dirichlet vector to a beta distribution, which makes traditional unimodal approximations incorrect. A composite probability model is proposed that combines beta, gamma, and Poisson components, as well as covariate accounting for semantic connectivity. The model parameters are determined by the differential evolution method using a criterion that includes the Wasserstein distance and the Jensen–Shannon and Kulbak–Leibler divergences. Based on the corpus of texts from the information field of the Rosatom State Corporation, it has been established that the new model is more accurate than lognormal, Pareto, exponential, and normal approximations, allowing for reliable characterization of thematic flows and supporting decisions in large text data monitoring tasks.

    Keywords: system analysis, latent Dirichlet allocation, topic modeling, Dirichlet latent distribution, topic signal intensity, beta distribution, gamma distribution, Poisson process, Jensen–Shannon divergence, Wasserstein distance, Kulbak–Leibler divergence