The article addresses a significant limitation of the classic TF-IDF method in the context of topic modeling for specialized text corpora. While effective at creating structurally distinct document clusters, TF-IDF often fails to produce semantically coherent topics. This shortcoming stems from its reliance on the bag-of-words model, which ignores semantic relationships between terms, leading to orthogonal and sparse vector representations. This issue is particularly acute in narrow domains where synonyms and semantically related terms are prevalent. To overcome this, the authors propose a novel approach: a contextual-diffusion method for enriching the TF-IDF matrix.
The core of the proposed method involves an iterative procedure of contextual smoothing based on a directed graph of semantic proximity, built using an asymmetric measure of term co-occurrence. This process effectively redistributes term weights not only to the words present in a document but also to their semantic neighbors, thereby capturing the contextual "halo" of concepts.
The method was tested on a corpus of news texts from the highly specialized field of atomic energy. A comparative analysis was conducted using a set of clustering and semantic metrics, such as the silhouette coefficient and topic coherence. The results demonstrate that the new approach, while slightly reducing traditional metrics of structural clarity, drastically enhances the thematic coherence and diversity of the extracted topics. This enables a shift from mere statistical clustering towards the identification of semantically integral and interpretable themes, which is crucial for tasks involving the monitoring and analysis of large textual data in specialized domains.
Keywords: thematic modeling, latent Dirichlet placement, TF-IDF, contextual blurring, semantic proximity, co-occurrence, text vectorization, bag of words model, thematic coherence, natural language processing, silhouette coefficient, text data analysis