Skip to content

Conversation

@willmj
Copy link
Collaborator

@willmj willmj commented Dec 4, 2024

Description of the change

Adding changes to enable loading of Arrow Dataset files by datasets.load_dataset

Datasets were converted from json using the following script:

import pyarrow as pa
import pyarrow.json as pajson
import json

json_file = <file>
table = pajson.read_json(json_file)

output_file = <output file>
with pa.OSFile(output_file, 'wb') as sink:
    with pa.ipc.new_file(sink, table.schema) as writer:
        writer.write(table)

Using this script we can take a look at the new datasets added:

import pyarrow as pa
import pyarrow.ipc as ipc

arrow_file = "tests/artifacts/testdata/twitter_complaints_small.arrow"

with pa.OSFile(arrow_file, 'rb') as source:
    reader = ipc.open_file(source)
    table = reader.read_all()

print(table)

df = table.to_pandas()
print(df)

pretokenized:
image

input_output:
image

small:
image

Related issue number

Following PR #401

How to verify the PR

Run

python tuning/sft_trainer.py  \
--model_name_or_path Maykeye/TinyLLama-v0  \
--training_data_path tests/artifacts/testdata/twitter_complaints_input_output.arrow \
--output_dir outputs/full-tuning  \
--num_train_epochs 5  \
--per_device_train_batch_size 2  \
--gradient_accumulation_steps 1  \
--learning_rate 1e-5  \
--use_flash_attn false \
--torch_dtype "float32"

tox -e py

To see end to end build and results on llama3 8b, see Travis CI build (click on jobs and look through logs to see training, inference, response)

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@github-actions
Copy link

github-actions bot commented Dec 4, 2024

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the test label Dec 4, 2024
Signed-off-by: Will Johnson <[email protected]>
@willmj willmj marked this pull request as ready for review December 6, 2024 18:18
@dushyantbehl
Copy link
Collaborator

LGTM thanks @willmj

Signed-off-by: Ashok Pon Kumar <[email protected]>
@dushyantbehl
Copy link
Collaborator

Looks good.

Signed-off-by: Ashok Pon Kumar <[email protected]>
@dushyantbehl
Copy link
Collaborator

All good okay to merge @ashokponkumar thanks

@ashokponkumar ashokponkumar enabled auto-merge (squash) December 7, 2024 09:11
@ashokponkumar ashokponkumar merged commit e6f7a22 into foundation-model-stack:main Dec 7, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants