Releases · strangetom/ingredient-parser

24 Oct 18:24

strangetom

0.1.0-beta6

595556a

0.1.0-beta6 Pre-release

Pre-release

Support parsing of preparation steps from ingredients e.g. finely chopped, diced
- These are returned in the ParsedIngredient.preparation field instead of the comment field as previously
Removal of StrangerFoods dataset from model training due to lack of PREP labels
Addition of a BBC Food dataset in the model training
- 10,000 additional ingredient sentences from the archive of 10599 recipes found at https://archive.org/details/recipes-en-201706
Miscellaneous bug fixes to the preprocessing steps to resolve reported issues
- Handling of fractions with the format: 1 and 1/2
- Handling of amounts followed by 'x' e.g. 1x can
- Handling of ranges where the units were duplicated: 100g - 200g

Assets 2

16 Sep 19:57

strangetom

0.1.0-beta5

ab0cd83

0.1.0-beta5 Pre-release

Pre-release

Support the extraction of multiple amounts from the input sentence.
Change output dataclass to put confidence values with each field.
- The name, comment, other fields are output as an IngredientText object containing the text and confidence
- The amounts are output as an IngredientAmount object containing the quantity, unit, confidence and flags for whether the amount is approximate or for a singular item of the ingredient.
Rewrite post-processing functionality to make it more maintainable and extensible in the future.
Add a model card, which provides information about the data used to train and evaluate the model, the purpose of the model and it's limitations.
Increase l1 regularisation during model training.
- This reduces model size by a factor of ~4.
- This should improve performance on sentences not seen before by forcing to the model to rely less on labelling specific words.
Improve the model guide in the documentation.
Add a simple webapp that can be used to view the output of the parser in a more human-readable way.

Example of the output at this release

>>> parse_ingredient("50ml/2fl oz/3½tbsp lavender honey (or other runny honey if unavailable)")
ParsedIngredient(
    name=IngredientText(
        text='lavender honey',
        confidence=0.998829),
    amount=[
        IngredientAmount(
            quantity='50',
            unit='ml',
            confidence=0.999189,
            APPROXIMATE=False,
            SINGULAR=False),
        IngredientAmount(
            quantity='2',
            unit='fl oz',
            confidence=0.980392,
            APPROXIMATE=False,
            SINGULAR=False),
        IngredientAmount(
            quantity='3.5',
            unit='tbsps',
            confidence=0.990711,
            APPROXIMATE=False,
            SINGULAR=False)
    ],
    comment=IngredientText(
            text='(or other runny honey if  unavailable)',
            confidence=0.973682
    ),
    other=None,
    sentence='50ml/2fl oz/3½tbsp lavender honey (or other runny  honey if unavailable)'
)

Assets 2

16 Aug 19:27

strangetom

0.1.0-beta4-hotfix

4174836

0.1.0-beta4 Pre-release

Pre-release

Include new source of training data: cookstr.
- 10,000 additional ingredient sentences from the archive of 7918 recipes (~40,000 total ingredient sentences) found at https://archive.org/details/recipes-en-201706 are now used in the training of the model.
The parse_ingredient function now returns a ParsedIngredient dataclass instead of a dict.
- Remove dependency on typing_extensions as a result of this
A model card is now provided that gives details about how the model was trained, performs, is intended to be used, and limitations.
- The model card is distributed with the package and there is a function show_model_card() that will open the model card in the default application for markdown files.
Improvements to the ingredient sentence preprocessing:
- Expand the list of units
- Tweak the tokenizer to handle more punctuation
- Fix various bugs with the cleaning steps

As a result of these updates the model performance has improved to:

Sentence-level results:
    Total: 12030
    Correct: 10776
    Incorrect: 1254
    -> 89.58% correct

Word-level results:
    Total: 75146
    Correct: 72329
    Incorrect: 2817
    -> 96.25% correct

Assets 2

18 Jul 19:08

strangetom

0.1.0-beta3

b26fb00

0.1.0-beta3 Pre-release

Pre-release

Correct minimum python version to 3.10 due to use of type hints introduced in 3.10.

Assets 2

18 Jul 18:16

strangetom

0.1.0-beta2

bd5aba6

0.1.0-beta2 Pre-release

Pre-release

Add new feature that indicates if a token is ambiguous, for example "clove" could be a unit or a name.
Add preprocessing step to remove trailing periods from certain units e.g. tsp. becomes tsp

Assets 2

08 Apr 18:06

strangetom

0.1.0-beta1

a94e00e

0.1.0-beta1 Pre-release

Pre-release

Change the features extracted from an ingredient sentence
- Replace the word with the stem of the word
- Add feature for follows "plus"
- Change features combining current and next/previous part of speech to just use the next/previous part of speech
Improve handling of plural units
- Units are made singular before passing to CRF model. The repluralisation of units is based on whether they were made singular in the first place or not.
Add test cases for the parser_ingredient function
- Not all test cases pass yet - failures will be future improvements (hopefully)
Better align behaviour of regex parser with CRF-based parser.

Assets 2

22 Dec 11:15

strangetom

0.1.0-alpha4

604bf9f

0.1.0-alpha4 Pre-release

Pre-release

Minor fixes to documentation
Apply re-pluralization to regex parser

Assets 2

02 Oct 18:09

strangetom

0.1.0-alpha3

9700680

0.1.0-alpha3 Pre-release

Pre-release

Incremental changes:

Fix re-pluralisation of units not actually working in 0.1.0-alpha2.
Configure development tools in pyproject.toml.
Fixes to documentation.
Fixes to NYT data.
Additional sentence features:
- is_stop_word
- is_after_comma
Only create features that are possible for the token e.g. there is no prev_word for the first token, so don't create the feature at all instead of using an empty string.
Refactor code for easier maintenance and flake8 compliance .

Assets 2

12 Sep 18:41

strangetom

0.1.0-alpha2

f5484b9

0.1.0-alpha2 Pre-release

Pre-release

Incremental changes:

Improved documentation
- Automatically extract code and version from source files.
Added regular expression based parser
- This provides an alternative to the CRF-based parser, but is more limited
Improvements to labelling of New York Times dataset
- Label size modifiers for unit as part of the unit e.g. large clove, small bunch
- Consistent labelling of "juice of..." variants
- Consistent labelling of "chopped"
- Consistent labelling of "package"
- Reduce number of token labelled as OTHER because they were missing from the label
Fixes and improvements to pre-processing input sentences
- Expand list of units to be singularised
- Fix the preprocessing incorrectly handling words with different cases
- Improve matching and replacement of string numbers e.g. one -> 1
- Fix unicode fraction replacement not replacing
Improvements to post-processing the model output
- Pluralise units if the quantity is not singular
Start adding tests to PreProcessor class methods

Assets 2

06 Sep 19:07

strangetom

0.1.0-alpha1

bac57c4

0.1.0-alpha1 Pre-release

Pre-release

Initial release of package.

There are probably a bunch of errors to fix and improvements to make since this is my first attempt and building a python package.

Assets 2

Releases: strangetom/ingredient-parser

0.1.0-beta6

Uh oh!

0.1.0-beta5

Uh oh!

0.1.0-beta4

Uh oh!

0.1.0-beta3

Uh oh!

0.1.0-beta2

Uh oh!

0.1.0-beta1

Uh oh!

0.1.0-alpha4

Uh oh!

0.1.0-alpha3

Uh oh!

0.1.0-alpha2

Uh oh!

0.1.0-alpha1

Uh oh!