Tag cct
You're referring to the Common Crawl Corpus (CC-News)!
The Common Crawl Corpus (CC-News) is a large-scale dataset of web pages, crawled and processed by the Common Crawl organization. It's a popular dataset for natural language processing (NLP) and information retrieval tasks, such as text classification, sentiment analysis, and topic modeling.
Here are some key features of the CC-News dataset:
- Scale: The dataset contains over 100 billion web pages, making it one of the largest publicly available datasets of its kind.
- Format: The data is stored in a JSON format, with each document representing a single web page.
- Content: The dataset includes a wide range of web pages, including news articles, blogs, forums, and more.
- Language: The dataset is primarily in English, but it also includes content in other languages.
- Timestamps: Each document includes a timestamp indicating when the page was crawled.
The CC-News dataset is widely used in research and industry applications, including:
- Information retrieval: The dataset is used to develop and evaluate information retrieval systems, such as search engines and recommender systems.
- Natural language processing: The dataset is used to train and evaluate NLP models, such as language models, sentiment analysis models, and topic models.
- Data mining: The dataset is used to extract insights and patterns from the web, such as identifying trends and sentiment shifts over time.
If you're interested in using the CC-News dataset, you can download it from the Common Crawl website or access it through various cloud-based services, such as AWS or Google Cloud.