diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 378d272fd..8a3b87e89 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -175,6 +175,8 @@ sections: - local: datasets-argilla title: Argilla + - local: datasets-daft + title: Daft - local: datasets-dask title: Dask - local: datasets-usage diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index 7bf34b2a4..19ccef83f 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -67,7 +67,7 @@ The rich features set in the `huggingface_hub` library allows you to manage repo ## Using other libraries -Some libraries like [🤗 Datasets](/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Polars](https://pola.rs), [Dask](https://www.dask.org/) or [DuckDB](https://duckdb.org/) can upload files to the Hub. +Some libraries like [🤗 Datasets](/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Polars](https://pola.rs), [Dask](https://www.dask.org/), [DuckDB](https://duckdb.org/), or [Daft](https://daft.ai/) can upload files to the Hub. See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information. ## Using Git diff --git a/docs/hub/datasets-daft.md b/docs/hub/datasets-daft.md new file mode 100644 index 000000000..cc92fc649 --- /dev/null +++ b/docs/hub/datasets-daft.md @@ -0,0 +1,79 @@ +# Daft + +[Daft](https://daft.ai/) is a high-performance data engine providing simple and reliable data processing for any modality and scale. Daft has native support for reading from and writing to Hugging Face datasets. + +
+ +
+ + +## Getting Started + +To get started, pip install `daft` with the `huggingface` feature: + +```bash +pip install 'daft[hugggingface]' +``` + +## Read + +Daft is able to read datasets directly from the Hugging Face Hub using the [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface) function or via the `hf://datasets/` protocol. + +### Reading an Entire Dataset + +Using [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface), you can easily load a dataset. + + +```python +import daft + +df = daft.read_huggingface("username/dataset_name") +``` + +This will read the entire dataset into a DataFrame. + +### Reading Specific Files + +Not only can you read entire datasets, but you can also read individual files from a dataset repository. Using a read function that takes in a path (such as [`daft.read_parquet()`](https://docs.daft.ai/en/stable/api/io/#daft.read_parquet), [`daft.read_csv()`](https://docs.daft.ai/en/stable/api/io/#daft.read_csv), or [`daft.read_json()`](https://docs.daft.ai/en/stable/api/io/#daft.read_json)), specify a Hugging Face dataset path via the `hf://datasets/` prefix: + +```python +import daft + +# read a specific Parquet file +df = daft.read_parquet("hf://datasets/username/dataset_name/file_name.parquet") + +# or a csv file +df = daft.read_csv("hf://datasets/username/dataset_name/file_name.csv") + +# or a set of Parquet files using a glob pattern +df = daft.read_parquet("hf://datasets/username/dataset_name/**/*.parquet") +``` + +## Write + +Daft is able to write Parquet files to a Hugging Face dataset repository using [`daft.DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_deltalake). Daft supports [Content-Defined Chunking](https://huggingface.co/blog/parquet-cdc) and [Xet](https://huggingface.co/blog/xet-on-the-hub) for faster, deduplicated writes. + +Basic usage: + +```python +import daft + +df: daft.DataFrame = ... + +df.write_huggingface("username/dataset_name") +``` + +See the [`DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_deltalake) API page for more info. + +## Authentication + +The `token` parameter in [`daft.io.HuggingFaceConfig`](https://docs.daft.ai/en/stable/api/config/#daft.io.HuggingFaceConfig) can be used to specify a Hugging Face access token for requests that require authentication (e.g. reading private dataset repositories or writing to a dataset repository). + +Example of loading a dataset with a specified token: + +```python +from daft.io import IOConfig, HuggingFaceConfig + +io_config = IOConfig(hf=HuggingFaceConfig(token="your_token")) +df = daft.read_parquet("hf://datasets/username/dataset_name", io_config=io_config) +``` diff --git a/docs/hub/datasets-libraries.md b/docs/hub/datasets-libraries.md index aa39757d1..4d05bef93 100644 --- a/docs/hub/datasets-libraries.md +++ b/docs/hub/datasets-libraries.md @@ -9,6 +9,7 @@ The table below summarizes the supported libraries and their level of integratio | Library | Description | Download from Hub | Push to Hub | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ----------------- | ----------- | | [Argilla](./datasets-argilla) | Collaboration tool for AI engineers and domain experts that value high quality data. | ✅ | ✅ | +| [Daft](./datasets-daft) | Data engine for large scale, multimodal data processing with a Python-native interface. | ✅ | ✅ | | [Dask](./datasets-dask) | Parallel and distributed computing library that scales the existing Python and PyData ecosystem. | ✅ | ✅ | | [Datasets](./datasets-usage) | 🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). | ✅ | ✅ | | [Distilabel](./datasets-distilabel) | The framework for synthetic data generation and AI feedback. | ✅ | ✅ | @@ -87,7 +88,7 @@ Examples of this kind of integration: #### Rely on an existing libraries integration with the Hub -Polars, Pandas, Dask, Spark and DuckDB all can write to a Hugging Face Hub repository. See [datasets libraries](https://huggingface.co/docs/hub/datasets-libraries) for more details. +Polars, Pandas, Dask, Spark, DuckDB, and Daft can all write to a Hugging Face Hub repository. See [datasets libraries](https://huggingface.co/docs/hub/datasets-libraries) for more details. If you are already using one of these libraries in your code, adding the ability to push to the Hub is straightforward. For example, if you have a synthetic data generation library that can return a Pandas DataFrame, here is the code you would need to write to the Hub: diff --git a/docs/inference-providers/_toctree.yml b/docs/inference-providers/_toctree.yml index 3feaac75f..29369a606 100644 --- a/docs/inference-providers/_toctree.yml +++ b/docs/inference-providers/_toctree.yml @@ -6,42 +6,9 @@ title: Pricing and Billing - local: hub-integration title: Hub integration - - local: register-as-a-provider - title: Register as an Inference Provider - local: security title: Security -- title: Providers - sections: - - local: providers/cerebras - title: Cerebras - - local: providers/cohere - title: Cohere - - local: providers/fal-ai - title: Fal AI - - local: providers/featherless-ai - title: Featherless AI - - local: providers/fireworks-ai - title: Fireworks - - local: providers/groq - title: Groq - - local: providers/hyperbolic - title: Hyperbolic - - local: providers/hf-inference - title: HF Inference - - local: providers/nebius - title: Nebius - - local: providers/novita - title: Novita - - local: providers/nscale - title: Nscale - - local: providers/replicate - title: Replicate - - local: providers/sambanova - title: SambaNova - - local: providers/together - title: Together - - title: Guides sections: - local: guides/first-api-call @@ -57,25 +24,19 @@ - local: guides/image-editor title: Build an Image Editor - -- title: API Reference +- local: tasks/index + title: Inference Tasks sections: - - local: tasks/index - title: Index - - local: hub-api - title: Hub API - - title: Popular Tasks - sections: - - local: tasks/chat-completion - title: Chat Completion - - local: tasks/feature-extraction - title: Feature Extraction - - local: tasks/text-to-image - title: Text to Image - - local: tasks/text-to-video - title: Text to Video + - local: tasks/chat-completion + title: Chat Completion + - local: tasks/feature-extraction + title: Feature Extraction + - local: tasks/text-to-image + title: Text to Image + - local: tasks/text-to-video + title: Text to Video - title: Other Tasks - isExpanded: false + isExpanded: False sections: - local: tasks/audio-classification title: Audio Classification @@ -108,4 +69,41 @@ - local: tasks/translation title: Translation - local: tasks/zero-shot-classification - title: Zero Shot Classification \ No newline at end of file + title: Zero Shot Classification + +- title: Providers + sections: + - local: providers/cerebras + title: Cerebras + - local: providers/cohere + title: Cohere + - local: providers/fal-ai + title: Fal AI + - local: providers/featherless-ai + title: Featherless AI + - local: providers/fireworks-ai + title: Fireworks + - local: providers/groq + title: Groq + - local: providers/hyperbolic + title: Hyperbolic + - local: providers/hf-inference + title: HF Inference + - local: providers/nebius + title: Nebius + - local: providers/novita + title: Novita + - local: providers/nscale + title: Nscale + - local: providers/replicate + title: Replicate + - local: providers/sambanova + title: SambaNova + - local: providers/together + title: Together + +- local: hub-api + title: Hub API + +- local: register-as-a-provider + title: Register as an Inference Provider \ No newline at end of file diff --git a/docs/inference-providers/providers/featherless-ai.md b/docs/inference-providers/providers/featherless-ai.md index 35f31227d..2f89a71fb 100644 --- a/docs/inference-providers/providers/featherless-ai.md +++ b/docs/inference-providers/providers/featherless-ai.md @@ -52,7 +52,7 @@ Find out more about Chat Completion (LLM) [here](../tasks/chat-completion). @@ -72,6 +72,6 @@ Find out more about Text Generation [here](../tasks/text_generation). diff --git a/docs/inference-providers/providers/hf-inference.md b/docs/inference-providers/providers/hf-inference.md index 7fb0948e6..4a84137b1 100644 --- a/docs/inference-providers/providers/hf-inference.md +++ b/docs/inference-providers/providers/hf-inference.md @@ -57,16 +57,6 @@ Find out more about Automatic Speech Recognition [here](../tasks/automatic_speec /> -### Chat Completion (LLM) - -Find out more about Chat Completion (LLM) [here](../tasks/chat-completion). - - - - ### Feature Extraction Find out more about Feature Extraction [here](../tasks/feature_extraction). @@ -93,7 +83,7 @@ Find out more about Image Classification [here](../tasks/image_classification). @@ -103,7 +93,7 @@ Find out more about Image Segmentation [here](../tasks/image_segmentation). @@ -153,17 +143,7 @@ Find out more about Text Classification [here](../tasks/text_classification). - - -### Text Generation - -Find out more about Text Generation [here](../tasks/text_generation). - - @@ -183,7 +163,7 @@ Find out more about Token Classification [here](../tasks/token_classification). @@ -203,6 +183,6 @@ Find out more about Zero Shot Classification [here](../tasks/zero_shot_classific diff --git a/docs/inference-providers/providers/nebius.md b/docs/inference-providers/providers/nebius.md index 325312a54..7d7ab29cc 100644 --- a/docs/inference-providers/providers/nebius.md +++ b/docs/inference-providers/providers/nebius.md @@ -50,7 +50,7 @@ Find out more about Chat Completion (LLM) [here](../tasks/chat-completion). @@ -80,7 +80,7 @@ Find out more about Text Generation [here](../tasks/text_generation). diff --git a/docs/inference-providers/providers/replicate.md b/docs/inference-providers/providers/replicate.md index 6487d1fd7..a53f94ce9 100644 --- a/docs/inference-providers/providers/replicate.md +++ b/docs/inference-providers/providers/replicate.md @@ -70,6 +70,6 @@ Find out more about Text To Video [here](../tasks/text_to_video). diff --git a/docs/inference-providers/providers/together.md b/docs/inference-providers/providers/together.md index 617992c4f..2f24756d8 100644 --- a/docs/inference-providers/providers/together.md +++ b/docs/inference-providers/providers/together.md @@ -50,7 +50,7 @@ Find out more about Chat Completion (LLM) [here](../tasks/chat-completion). @@ -70,7 +70,7 @@ Find out more about Text Generation [here](../tasks/text_generation). diff --git a/docs/inference-providers/register-as-a-provider.md b/docs/inference-providers/register-as-a-provider.md index eab44d0bc..2f62c4676 100644 --- a/docs/inference-providers/register-as-a-provider.md +++ b/docs/inference-providers/register-as-a-provider.md @@ -2,12 +2,18 @@ --Want to be listed as an Inference Provider on the Hugging Face Hub? Let's get in touch! +Want to be listed as an Inference Provider on the Hugging Face Hub? Let's get in touch! Please reach out to us on social networks or [here on the Hub](https://huggingface.co/spaces/huggingface/HuggingDiscussions/discussions/49). + + +Note that Step 3 will require your organization to upgrade their Hub account to a [Team or Enterprise plan](https://huggingface.co/pricing). + + + This guide details the steps for registering as an inference provider on the Hub and provides implementation guidance. 1. **Implement standard task APIs** - Follow our task API schemas for compatibility (see [Prerequisites](#1-prerequisites)). @@ -131,7 +137,7 @@ First step is to use the Model Mapping API to register which HF models are suppo -To proceed with this step, we have to enable your account server-side. Make sure you have an organization on the Hub for your company, and upgrade it to a Team or Enterprise plan. +To proceed with this step, we have to enable your account server-side. Make sure you have an organization on the Hub for your company, and upgrade it to a [Team or Enterprise plan](https://huggingface.co/pricing). diff --git a/docs/inference-providers/tasks/chat-completion.md b/docs/inference-providers/tasks/chat-completion.md index f1fbb1654..154d6c18a 100644 --- a/docs/inference-providers/tasks/chat-completion.md +++ b/docs/inference-providers/tasks/chat-completion.md @@ -64,7 +64,7 @@ The API supports: diff --git a/docs/inference-providers/tasks/fill-mask.md b/docs/inference-providers/tasks/fill-mask.md index d527ce0df..33344e16b 100644 --- a/docs/inference-providers/tasks/fill-mask.md +++ b/docs/inference-providers/tasks/fill-mask.md @@ -24,7 +24,6 @@ For more details about the `fill-mask` task, check out its [dedicated page](http ### Recommended models -- [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base): A multilingual model trained on 100 languages. Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=fill-mask&sort=trending). diff --git a/docs/inference-providers/tasks/image-classification.md b/docs/inference-providers/tasks/image-classification.md index 3321890eb..01799686e 100644 --- a/docs/inference-providers/tasks/image-classification.md +++ b/docs/inference-providers/tasks/image-classification.md @@ -34,7 +34,7 @@ Explore all available models and find the one that suits you best [here](https:/ diff --git a/docs/inference-providers/tasks/image-segmentation.md b/docs/inference-providers/tasks/image-segmentation.md index 1ceca0e68..9dd683387 100644 --- a/docs/inference-providers/tasks/image-segmentation.md +++ b/docs/inference-providers/tasks/image-segmentation.md @@ -24,7 +24,6 @@ For more details about the `image-segmentation` task, check out its [dedicated p ### Recommended models -- [facebook/mask2former-swin-large-coco-panoptic](https://huggingface.co/facebook/mask2former-swin-large-coco-panoptic): Panoptic segmentation model trained on the COCO (common objects) dataset. Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=image-segmentation&sort=trending). @@ -33,7 +32,7 @@ Explore all available models and find the one that suits you best [here](https:/ diff --git a/docs/inference-providers/tasks/question-answering.md b/docs/inference-providers/tasks/question-answering.md index 2f1330014..220b9cfbf 100644 --- a/docs/inference-providers/tasks/question-answering.md +++ b/docs/inference-providers/tasks/question-answering.md @@ -25,7 +25,6 @@ For more details about the `question-answering` task, check out its [dedicated p ### Recommended models - [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2): A robust baseline model for most question answering domains. -- [distilbert/distilbert-base-cased-distilled-squad](https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad): Small yet robust model that can answer questions. - [google/tapas-base-finetuned-wtq](https://huggingface.co/google/tapas-base-finetuned-wtq): A special model that can answer questions from tables. Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=question-answering&sort=trending). diff --git a/docs/inference-providers/tasks/summarization.md b/docs/inference-providers/tasks/summarization.md index 6d3994406..6e0ff5ead 100644 --- a/docs/inference-providers/tasks/summarization.md +++ b/docs/inference-providers/tasks/summarization.md @@ -25,7 +25,6 @@ For more details about the `summarization` task, check out its [dedicated page]( ### Recommended models - [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn): A strong summarization model trained on English news articles. Excels at generating factual summaries. -- [Falconsai/medical_summarization](https://huggingface.co/Falconsai/medical_summarization): A summarization model trained on medical articles. Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=summarization&sort=trending). diff --git a/docs/inference-providers/tasks/text-classification.md b/docs/inference-providers/tasks/text-classification.md index 1202d5f48..6ecaefc66 100644 --- a/docs/inference-providers/tasks/text-classification.md +++ b/docs/inference-providers/tasks/text-classification.md @@ -25,7 +25,6 @@ For more details about the `text-classification` task, check out its [dedicated ### Recommended models - [distilbert/distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english): A robust model trained for sentiment analysis. -- [ProsusAI/finbert](https://huggingface.co/ProsusAI/finbert): A sentiment analysis model specialized in financial sentiment. Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=text-classification&sort=trending). @@ -34,7 +33,7 @@ Explore all available models and find the one that suits you best [here](https:/ diff --git a/docs/inference-providers/tasks/text-generation.md b/docs/inference-providers/tasks/text-generation.md index 9f9747d59..e45ad84fc 100644 --- a/docs/inference-providers/tasks/text-generation.md +++ b/docs/inference-providers/tasks/text-generation.md @@ -42,7 +42,7 @@ Explore all available models and find the one that suits you best [here](https:/ diff --git a/docs/inference-providers/tasks/text-to-video.md b/docs/inference-providers/tasks/text-to-video.md index 9d8625958..c80c0324b 100644 --- a/docs/inference-providers/tasks/text-to-video.md +++ b/docs/inference-providers/tasks/text-to-video.md @@ -35,7 +35,7 @@ Explore all available models and find the one that suits you best [here](https:/ diff --git a/docs/inference-providers/tasks/token-classification.md b/docs/inference-providers/tasks/token-classification.md index dc5df09f2..d0b614942 100644 --- a/docs/inference-providers/tasks/token-classification.md +++ b/docs/inference-providers/tasks/token-classification.md @@ -24,9 +24,6 @@ For more details about the `token-classification` task, check out its [dedicated ### Recommended models -- [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER): A robust performance model to identify people, locations, organizations and names of miscellaneous entities. -- [FacebookAI/xlm-roberta-large-finetuned-conll03-english](https://huggingface.co/FacebookAI/xlm-roberta-large-finetuned-conll03-english): A strong model to identify people, locations, organizations and names in multiple languages. -- [blaze999/Medical-NER](https://huggingface.co/blaze999/Medical-NER): A token classification model specialized on medical entity recognition. Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=token-classification&sort=trending). @@ -35,7 +32,7 @@ Explore all available models and find the one that suits you best [here](https:/ diff --git a/docs/inference-providers/tasks/zero-shot-classification.md b/docs/inference-providers/tasks/zero-shot-classification.md index 1c57edfb9..c99069be0 100644 --- a/docs/inference-providers/tasks/zero-shot-classification.md +++ b/docs/inference-providers/tasks/zero-shot-classification.md @@ -24,7 +24,6 @@ For more details about the `zero-shot-classification` task, check out its [dedic ### Recommended models -- [facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli): Powerful zero-shot text classification model. Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=zero-shot-classification&sort=trending). @@ -33,7 +32,7 @@ Explore all available models and find the one that suits you best [here](https:/ diff --git a/scripts/inference-providers/package.json b/scripts/inference-providers/package.json index 4bf51ece9..23656ab28 100644 --- a/scripts/inference-providers/package.json +++ b/scripts/inference-providers/package.json @@ -15,7 +15,7 @@ "license": "ISC", "dependencies": { "@huggingface/inference": "^4.7.1", - "@huggingface/tasks": "^0.19.37", + "@huggingface/tasks": "^0.19.43", "@types/node": "^22.5.0", "handlebars": "^4.7.8", "node": "^20.17.0", diff --git a/scripts/inference-providers/pnpm-lock.yaml b/scripts/inference-providers/pnpm-lock.yaml index 0fe303ebe..bec99cb29 100644 --- a/scripts/inference-providers/pnpm-lock.yaml +++ b/scripts/inference-providers/pnpm-lock.yaml @@ -12,8 +12,8 @@ importers: specifier: ^4.7.1 version: 4.7.1 '@huggingface/tasks': - specifier: ^0.19.37 - version: 0.19.37 + specifier: ^0.19.43 + version: 0.19.43 '@types/node': specifier: ^22.5.0 version: 22.5.0 @@ -197,8 +197,8 @@ packages: resolution: {integrity: sha512-yUZLld4lrM9iFxHCwFQ7D1HW2MWMwSbeB7WzWqFYDWK+rEb+WldkLdAJxUPOmgICMHZLzZGVcVjFh3w/YGubng==} engines: {node: '>=18'} - '@huggingface/tasks@0.19.37': - resolution: {integrity: sha512-Te1VB1tB1HoLfTGluCwy8sLO90YV+uNOAFktQ1h7jKas4TlHT/7SlfwFaDJFTV8lN7qCw2nDB+7PRkKzwIb/hg==} + '@huggingface/tasks@0.19.43': + resolution: {integrity: sha512-ANO23K3ugclBl6VLwdt+7MxBkRkKEE17USUSqprHb29UB5ISigH+0AJcEuDA064uzn0hqYrG/nOcv1yARRt8bw==} '@jridgewell/resolve-uri@3.1.2': resolution: {integrity: sha512-bRISgCIjP20/tbWSPWMEi54QVPRZExkuD9lJL+UIxUKtwVJA8wW1Trb1jMs1RFXo1CBTNZ/5hpC9QvmKWdopKw==} @@ -418,11 +418,11 @@ snapshots: '@huggingface/inference@4.7.1': dependencies: '@huggingface/jinja': 0.5.1 - '@huggingface/tasks': 0.19.37 + '@huggingface/tasks': 0.19.43 '@huggingface/jinja@0.5.1': {} - '@huggingface/tasks@0.19.37': {} + '@huggingface/tasks@0.19.43': {} '@jridgewell/resolve-uri@3.1.2': {}