Skip to content

Releases: strangetom/ingredient-parser

0.1.0-beta6

24 Oct 18:24

Choose a tag to compare

0.1.0-beta6 Pre-release
Pre-release
  • Support parsing of preparation steps from ingredients e.g. finely chopped, diced
    • These are returned in the ParsedIngredient.preparation field instead of the comment field as previously
  • Removal of StrangerFoods dataset from model training due to lack of PREP labels
  • Addition of a BBC Food dataset in the model training
  • Miscellaneous bug fixes to the preprocessing steps to resolve reported issues
    • Handling of fractions with the format: 1 and 1/2
    • Handling of amounts followed by 'x' e.g. 1x can
    • Handling of ranges where the units were duplicated: 100g - 200g

0.1.0-beta5

16 Sep 19:57

Choose a tag to compare

0.1.0-beta5 Pre-release
Pre-release
  • Support the extraction of multiple amounts from the input sentence.
  • Change output dataclass to put confidence values with each field.
    • The name, comment, other fields are output as an IngredientText object containing the text and confidence
    • The amounts are output as an IngredientAmount object containing the quantity, unit, confidence and flags for whether the amount is approximate or for a singular item of the ingredient.
  • Rewrite post-processing functionality to make it more maintainable and extensible in the future.
  • Add a model card, which provides information about the data used to train and evaluate the model, the purpose of the model and it's limitations.
  • Increase l1 regularisation during model training.
    • This reduces model size by a factor of ~4.
    • This should improve performance on sentences not seen before by forcing to the model to rely less on labelling specific words.
  • Improve the model guide in the documentation.
  • Add a simple webapp that can be used to view the output of the parser in a more human-readable way.

Example of the output at this release

>>> parse_ingredient("50ml/2fl oz/3½tbsp lavender honey (or other runny honey if unavailable)")
ParsedIngredient(
    name=IngredientText(
        text='lavender honey',
        confidence=0.998829),
    amount=[
        IngredientAmount(
            quantity='50',
            unit='ml',
            confidence=0.999189,
            APPROXIMATE=False,
            SINGULAR=False),
        IngredientAmount(
            quantity='2',
            unit='fl oz',
            confidence=0.980392,
            APPROXIMATE=False,
            SINGULAR=False),
        IngredientAmount(
            quantity='3.5',
            unit='tbsps',
            confidence=0.990711,
            APPROXIMATE=False,
            SINGULAR=False)
    ],
    comment=IngredientText(
            text='(or other runny honey if  unavailable)',
            confidence=0.973682
    ),
    other=None,
    sentence='50ml/2fl oz/3½tbsp lavender honey (or other runny  honey if unavailable)'
)

0.1.0-beta4

16 Aug 19:27

Choose a tag to compare

0.1.0-beta4 Pre-release
Pre-release
  • Include new source of training data: cookstr.
  • The parse_ingredient function now returns a ParsedIngredient dataclass instead of a dict.
    • Remove dependency on typing_extensions as a result of this
  • A model card is now provided that gives details about how the model was trained, performs, is intended to be used, and limitations.
    • The model card is distributed with the package and there is a function show_model_card() that will open the model card in the default application for markdown files.
  • Improvements to the ingredient sentence preprocessing:
    • Expand the list of units
    • Tweak the tokenizer to handle more punctuation
    • Fix various bugs with the cleaning steps

As a result of these updates the model performance has improved to:

Sentence-level results:
    Total: 12030
    Correct: 10776
    Incorrect: 1254
    -> 89.58% correct

Word-level results:
    Total: 75146
    Correct: 72329
    Incorrect: 2817
    -> 96.25% correct

0.1.0-beta3

18 Jul 19:08

Choose a tag to compare

0.1.0-beta3 Pre-release
Pre-release

Correct minimum python version to 3.10 due to use of type hints introduced in 3.10.

0.1.0-beta2

18 Jul 18:16

Choose a tag to compare

0.1.0-beta2 Pre-release
Pre-release
  • Add new feature that indicates if a token is ambiguous, for example "clove" could be a unit or a name.
  • Add preprocessing step to remove trailing periods from certain units e.g. tsp. becomes tsp

0.1.0-beta1

08 Apr 18:06

Choose a tag to compare

0.1.0-beta1 Pre-release
Pre-release
  • Change the features extracted from an ingredient sentence
    • Replace the word with the stem of the word
    • Add feature for follows "plus"
    • Change features combining current and next/previous part of speech to just use the next/previous part of speech
  • Improve handling of plural units
    • Units are made singular before passing to CRF model. The repluralisation of units is based on whether they were made singular in the first place or not.
  • Add test cases for the parser_ingredient function
    • Not all test cases pass yet - failures will be future improvements (hopefully)
  • Better align behaviour of regex parser with CRF-based parser.

0.1.0-alpha4

22 Dec 11:15

Choose a tag to compare

0.1.0-alpha4 Pre-release
Pre-release
  • Minor fixes to documentation
  • Apply re-pluralization to regex parser

0.1.0-alpha3

02 Oct 18:09

Choose a tag to compare

0.1.0-alpha3 Pre-release
Pre-release

Incremental changes:

  • Fix re-pluralisation of units not actually working in 0.1.0-alpha2.
  • Configure development tools in pyproject.toml.
  • Fixes to documentation.
  • Fixes to NYT data.
  • Additional sentence features:
    • is_stop_word
    • is_after_comma
  • Only create features that are possible for the token e.g. there is no prev_word for the first token, so don't create the feature at all instead of using an empty string.
  • Refactor code for easier maintenance and flake8 compliance .

0.1.0-alpha2

12 Sep 18:41

Choose a tag to compare

0.1.0-alpha2 Pre-release
Pre-release

Incremental changes:

  • Improved documentation
    • Automatically extract code and version from source files.
  • Added regular expression based parser
    • This provides an alternative to the CRF-based parser, but is more limited
  • Improvements to labelling of New York Times dataset
    • Label size modifiers for unit as part of the unit e.g. large clove, small bunch
    • Consistent labelling of "juice of..." variants
    • Consistent labelling of "chopped"
    • Consistent labelling of "package"
    • Reduce number of token labelled as OTHER because they were missing from the label
  • Fixes and improvements to pre-processing input sentences
    • Expand list of units to be singularised
    • Fix the preprocessing incorrectly handling words with different cases
    • Improve matching and replacement of string numbers e.g. one -> 1
    • Fix unicode fraction replacement not replacing
  • Improvements to post-processing the model output
    • Pluralise units if the quantity is not singular
  • Start adding tests to PreProcessor class methods

0.1.0-alpha1

06 Sep 19:07

Choose a tag to compare

0.1.0-alpha1 Pre-release
Pre-release

Initial release of package.

There are probably a bunch of errors to fix and improvements to make since this is my first attempt and building a python package.