Popular dataset for Multi-label text classification

There are several popular datasets available for multi-label text classification. Here are a few examples:

  1. Reuters-21578: This dataset contains news articles from the Reuters news agency, where each article is assigned multiple labels from a predefined set of topics. It is widely used for multi-label classification research.

  2. Reuters Corpus Volume I (RCV1): Similar to Reuters-21578, RCV1 is a larger dataset that consists of news articles from Reuters. It contains over 800,000 documents and covers a wide range of topics.

  3. Amazon Reviews: This dataset includes product reviews from the Amazon e-commerce platform. Each review is associated with multiple product categories, such as electronics, books, or clothing. It can be used for multi-label sentiment analysis or product categorization tasks.

  4. Yahoo News Classification: This dataset contains news articles from the Yahoo News website, where each article is assigned multiple labels from a set of categories. It covers a broad range of topics, including sports, politics, entertainment, and more.

  5. Stack Overflow: Stack Overflow is a popular question-and-answer website for programming-related topics. The dataset includes questions asked on Stack Overflow, where each question is associated with multiple programming language tags. It can be used for multi-label topic classification in the programming domain.

  6. DBpedia: DBpedia is a dataset extracted from Wikipedia, where each document is labeled with multiple categories based on the Wikipedia ontology. It covers a wide range of topics and can be used for multi-label classification tasks.

These datasets provide a good starting point for multi-label text classification research and are widely used in the NLP community. They are publicly available and can be accessed for experimentation and model development.