Hybrid text mining models for investigative keyword expansion on child sexual abuse in the dark web

AI Summary

Dark web anonymity and fragmentation hinder CSAM detection; collected 2,414 Torch-indexed pages for empirical analysis.
Eigenvector Centrality outperformed TF-IDF and Word2Vec for precision and contextual relevance by identifying structurally central co-occurrence terms.
Hybrid model combining Eigenvector Centrality with Word2Vec yielded best keyword expansion, improving automated detection and scalable investigative efficiency.

PLoS One. 2026 May 8;21(5):e0344470. doi: 10.1371/journal.pone.0344470. eCollection 2026.

ABSTRACT

The distribution of child sexual abuse materials (CSAM) via the dark web continues to hinder digital investigations due to the network’s inherent anonymity and fragmentation. This work presents a comparative analysis of text mining techniques for extracting investigative keywords from CSAM-related content on the dark web and aims to establish a foundation for scalable, expandable keyword-based detection. Using a custom crawler, we collected data from 2,414 dark web pages indexed by the Torch search engine. Based on this dataset, three methods-TF-IDF, Eigenvector Centrality, and Word2Vec-were applied to extract CSAM-related keywords, and their effectiveness was evaluated through dark web search experiments measuring the retrieval performance of CSAM-related sites. Among the individual techniques, Eigenvector Centrality-a graph-based keyword ranking algorithm-showed the highest precision and contextual relevance by identifying structurally central terms within co-occurrence networks. Building on this, we developed hybrid models that combined Eigenvector Centrality with either TF-IDF or Word2Vec. In particular, the model integrating Eigenvector Centrality with Word2Vec-based semantic similarity proved most effective in expanding investigative clues and retrieving highly relevant keywords. Based on empirically collected and domain-specific dark web data, this work differs from prior studies by empirically demonstrating a multi-method approach that not only improves keyword accuracy but also enables the dynamic expansion of early-stage crime indicators. The proposed methodology offers practical value for automating the detection of illicit content and improving the operational efficiency of cyber investigations.

PMID:42102029 | DOI:10.1371/journal.pone.0344470

Document this CPD