diff --git a/data/datasets/medium_articles_posts/README.md b/data/datasets/medium_articles_posts/README.md new file mode 100644 index 0000000000..1b915cc296 --- /dev/null +++ b/data/datasets/medium_articles_posts/README.md @@ -0,0 +1,45 @@ +# Medium Articles Posts Dataset + +## Description + +The Medium Articles Posts dataset contains a collection of articles published on +the Medium platform. Each article entry includes information such as the +article's title, main content or text, associated URL or link, authors' names, +timestamps, and tags or categories. + +## Dataset Info + +The dataset consists of the following features: + +- **title**: _(string)_ The title of the Medium article. +- **text**: _(string)_ The main content or text of the Medium article. +- **url**: _(string)_ The URL or link to the Medium article. +- **authors**: _(string)_ The authors or contributors of the Medium article. +- **timestamp**: _(string)_ The timestamp or date when the Medium article was + published. +- **tags**: _(string)_ Tags or categories associated with the Medium article. + +## Dataset Size + +- **Total Dataset Size**: 1,044,746,687 bytes (approximately 1000 MB) + +## Splits + +The dataset is split into the following part: + +- **Train**: + - Number of examples: 192,368 + - Size: 1,044,746,687 bytes (approximately 1000 MB) + +## Download Size + +- **Compressed Download Size**: 601,519,297 bytes (approximately 600 MB) + +### Usage example + +```python +from datasets import load_dataset +#Load the dataset +dataset = load_dataset("Falah/medium_articles_posts") + +``` diff --git a/data/datasets/medium_articles_posts/__init__.py b/data/datasets/medium_articles_posts/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/datasets/medium_articles_posts/load_dataset.py b/data/datasets/medium_articles_posts/load_dataset.py new file mode 100644 index 0000000000..d8b750a3b8 --- /dev/null +++ b/data/datasets/medium_articles_posts/load_dataset.py @@ -0,0 +1,4 @@ +from datasets import load_dataset + +# Load the dataset +dataset = load_dataset("Falah/medium_articles_posts") diff --git a/data/datasets/medium_articles_posts/requirements.txt b/data/datasets/medium_articles_posts/requirements.txt new file mode 100644 index 0000000000..7883858ca7 --- /dev/null +++ b/data/datasets/medium_articles_posts/requirements.txt @@ -0,0 +1,2 @@ +datasets==2.9.0 + diff --git a/data/datasets/research_papers_dataset/ReadME.md b/data/datasets/research_papers_dataset/ReadME.md new file mode 100644 index 0000000000..f927f272f0 --- /dev/null +++ b/data/datasets/research_papers_dataset/ReadME.md @@ -0,0 +1,137 @@ +--- +dataset_info: + features: + - name: title + dtype: string + - name: abstract + dtype: string + splits: + - name: train + num_bytes: 2363569633 + num_examples: 2311491 + download_size: 1423881564 + dataset_size: 2363569633 +--- + +## Research Paper Dataset 2023 + +[Check out this website](https://huggingface.co/datasets/Falah/research_paper2023) + +### Dataset Information: + +The "Research Paper Dataset 2023" contains information related to research +papers. It includes the following features: + +- Title (dtype: string): The title of the research paper. +- Abstract (dtype: string): The abstract of the research paper. + +### Dataset Splits: + +The dataset is divided into one split: + +- Train Split: + - Name: train + - Number of Bytes: 2,363,569,633 + - Number of Examples: 2,311,491 + +### Download Information: + +- Download Size: 1,423,881,564 bytes +- Dataset Size: 2,363,569,633 bytes + +### Dataset Citation: + +If you use this dataset in your research or project, please cite it as follows: + +``` +@dataset{Research Paper Dataset 2023, + author = {Falah.G.Salieh}, + title = {Research Paper Dataset 2023,}, + year = {2023}, + publisher = {Hugging Face}, + version = {1.0}, + location = {Online}, + url = {Falah/research_paper2023} +} + + +``` + +### Apache License: + +The "Research Paper Dataset 2023" is distributed under the Apache License 2.0. +You can find a copy of the license in the LICENSE file of the dataset +repository. + +The specific licensing and usage terms for this dataset can be found in the +dataset repository or documentation. Please make sure to review and comply with +the applicable license and usage terms before downloading and using the dataset. + +### Example Usage: + +To load the "Research Paper Dataset 2023" using the Hugging Face Datasets +Library in Python, you can use the following code: + +```python +from datasets import load_dataset + +dataset = load_dataset("Falah/research_paper2023") +``` + +### Application of "Research Paper Dataset 2023" for NLP Text Classification and Chatbot Models + +The "Research Paper Dataset 2023" can be a valuable resource for various Natural +Language Processing (NLP) tasks, including text classification and generating +titles for books in the context of chatbot models. Here are some ways this +dataset can be utilized for these applications: + +1. **Text Classification**: The dataset's features, such as the title and + abstract of research papers, can be used to train a text classification + model. By assigning appropriate labels to the research papers based on their + topics or fields of study, the model can learn to classify new research + papers into different categories. For example, the model can predict whether + a research paper is related to computer science, biology, physics, etc. This + text classification model can then be adapted for other applications that + require categorizing text. + +2. **Book Title Generation for Chatbot Models**: By utilizing the research paper + titles in the dataset, a natural language generation model, such as a + sequence-to-sequence model or a transformer-based model, can be trained to + generate book titles. The model can be fine-tuned on the research paper + titles to learn patterns and structures in generating relevant and meaningful + book titles. This can be a useful feature for chatbot models that recommend + books based on specific research topics or areas of interest. + +### Potential Benefits: + +- Improved Chatbot Recommendations: With the ability to generate book titles + related to specific research topics, chatbot models can provide more relevant + and personalized book recommendations to users. +- Enhanced User Engagement: By incorporating the text classification model, the + chatbot can better understand user queries and respond more accurately, + leading to a more engaging user experience. +- Knowledge Discovery: Researchers and students can use the text classification + model to efficiently categorize large collections of research papers, enabling + quicker access to relevant information in specific domains. + +### Considerations: + +- Data Preprocessing: Before training the NLP models, appropriate data + preprocessing steps may be required, such as text cleaning, tokenization, and + encoding, to prepare the dataset for model input. +- Model Selection and Fine-Tuning: Choosing the right NLP model architecture and + hyperparameters, and fine-tuning the model on the specific tasks, can + significantly impact the model's performance and generalization ability. +- Ethical Use: Ensure that the generated book titles and text classification + predictions are used responsibly and ethically, respecting copyright and + intellectual property rights. + +### Conclusion: + +The "Research Paper Dataset 2023" holds great potential for enhancing NLP text +classification models and chatbot systems. By leveraging the dataset's features +and information, NLP applications can be developed to aid researchers, students, +and readers in finding relevant research papers and generating meaningful book +titles for their specific interests. Proper utilization of this dataset can lead +to more efficient information retrieval and improved user experiences in the +domain of research and academic literature exploration. diff --git a/data/datasets/research_papers_dataset/__init__.py b/data/datasets/research_papers_dataset/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/datasets/research_papers_dataset/load_dataset.py b/data/datasets/research_papers_dataset/load_dataset.py new file mode 100644 index 0000000000..4602f0d253 --- /dev/null +++ b/data/datasets/research_papers_dataset/load_dataset.py @@ -0,0 +1,3 @@ +from datasets import load_dataset + +dataset = load_dataset("Falah/research_paper2023") diff --git a/data/datasets/research_papers_dataset/requirements.txt b/data/datasets/research_papers_dataset/requirements.txt new file mode 100644 index 0000000000..7883858ca7 --- /dev/null +++ b/data/datasets/research_papers_dataset/requirements.txt @@ -0,0 +1,2 @@ +datasets==2.9.0 + diff --git a/data/datasets/sentiments-dataset-381-classes/README.md b/data/datasets/sentiments-dataset-381-classes/README.md new file mode 100644 index 0000000000..23a5526354 --- /dev/null +++ b/data/datasets/sentiments-dataset-381-classes/README.md @@ -0,0 +1,361 @@ +--- +dataset_info: + features: + - name: text + dtype: string + - name: sentiment + dtype: string + splits: + - name: train + num_bytes: 104602 + num_examples: 1061 + download_size: 48213 + dataset_size: 104602 +license: apache-2.0 +task_categories: + - text-classification +language: + - en +pretty_name: sentiments-dataset-381-classes +size_categories: + - 1K