Include new tokens features, which help improve performance:
- Word shape (e.g. cheese -> xxxxxx; Cheese -> Xxxxxx)
- N-gram (n=3, 4, 5) prefixes and suffixes of tokens
Add 15,000 new sentences to training data from AllRecipes. This dataset includes lots of branded ingredients, which the existing datasets were quite light on.
Tweaks to the model hyperparameters have yielded a model that is ~25% small, but with better performance than the previous model.

Processing

Change processing of numbers written as words (e.g. 'one', 'two' ). If the token is labelled as QTY, then the number will converted to a digit (i.e. 'one' -> 1) or collapsed into a range (i.e. 'one or two' -> 1-2), otherwise the token is left unchanged.

Assets 2

0 Join discussion

10 Aug 20:34

strangetom

1.0.1

f5c73ca

1.0.1

Warning

This version requires NLTK >=3.8.2

NLTK 3.8.2 changes the file format (from pickle to json) of the weights used by the part of speech tagger used in this project, to address some security concerns. This patch updates the NLTK resource checks performed when ingredient-parser is imported to check for the new json files, and downloads them if they are not present.

This version requires NLTK>=3.8.2.

Assets 2

17 Jun 15:44

strangetom

1.0.0

b058395

1.0.0

1.0

General

Improve performance when tagging multiple sentences. For large numbers of sentences (>1000), the performance improvement is ~100x.

Processing

Extend support for composite amounts that have the form e.g. 1 cup plus 1 tablespoon or 1 cup minus 1 tablespoon. Previously the phrase plus/minus 1 tablespoon would be returned in the comment. Now the whole phrase is captured as a CompositeAmount object.
Fix cases where the incorrect pint.Unit would be returned, caused by pint interpreting the unit as something else e.g. "pinch" -> "pico-inch".

Assets 2

0 Join discussion

27 May 16:43

strangetom

0.1.0-beta11

3a16425

0.1.0-beta11 Pre-release

Pre-release

General

Refactor package structure to make it more suitable for expansion to over languages.

Note: There aren't any plans to support other languages yet.

Model

Reduce duplication in training data
Introduce PURPOSE label for tokens that describe the purpose of the ingredient, such as for the dressing and for garnish.
Replace quantities with "!num" when determining the features for tokens so that the model doesn't need to learn all possible values quantities can take. This results in a small reduction in model size.

Processing

Various bug fixes to post-processing of tokens with labels NAME, COMMENT, PREP, PURPOSE, SIZE to correct punctuation and confidence calculations.
Modification of tokeniser to split full stops from the end of tokens. This helps to model avoid treating "token." and "token" as different cases to learn.
Add fallback functionality to parse_ingredient for cases where none of the tokens are labelled as NAME. This will select name as the token with the highest confidence of being labelled NAME, even though a different label has a high confidence for that token. This can be disabled by setting expect_name_in_output=False in parse_ingredient.

Assets 2

12 Apr 16:57

strangetom

0.1.0-beta10

a22da60

0.1.0-beta10 Pre-release

Pre-release

Bugfix

Fix incorrect python version specifier in package which was preventing pip in Python 3.12 downloading the latest version.

Assets 2

06 Apr 15:09

strangetom

0.1.0-beta9

3160198

0.1.0-beta9 Pre-release

Pre-release

General

Add github actions to run tests (#7, @boxydog)
Add pre-commit for use with development (#10, @boxydog)

Model

Add additional model performance metrics.
Add model hyper-parameter tuning functionality with python train.py gridsearch to iterate over specified training algorithms and hyper-parameters.
Add --detailed argument to output detailed information about model performance on test data. (#9, @boxydog)
Change model labels to treat label all punctuation as PUNC - this resolves some of the ambiguity in token labeling
Introduce SIZE label for tokens that modify the size of the ingredient. Note that his only applies to size modifiers of the ingredient. Size modifiers of the unit will remain part of the unit e.g. large clove.

Processing

Integration of pint library for units
- By default, units in IngredientAmount object will be returned as pint.Unit objects (where possible). This enables the easy conversion of amounts between different units. This can be disabled by setting string_units=True in the parse_ingredient function calls.
- For units that have US customary and Imperial version with the same name (e.g, cup), setting imperial_units=True in the parse_ingredient function calls will return the imperial version. The default is US customary.
- This only applies to units in pint's unit registry (basically all common, standardised units). If the unit can't be found, then the string is returned as previously.
Additions to IngredientAmount object:
- New quantity_max field for handling upper limit of ranges. If the quantity is not a range, this will default to same as the quantity field.
- Flags for RANGE and MULTIPLIER
  - RANGE is set to True for quantity ranges e.g. 1-2
  - MULTIPLIER is set to True for quantities like 1x
- Conversion of quantity field to float where possible
PreProcessor improvements
- Be less aggressive about replacing written numbers (e.g. one) with the digit version. For example, in sentences like 1 tsp Chinese five-spice, five-spice is now kept as written instead of being replaced by two tokens: 5 spice.
- Improve handling of ranges that duplicate the units e.g. 1 pound to 2 pound is now returned as 1-2 pound

Contributors

boxydog

Assets 2

27 Jan 11:16

strangetom

0.1.0-beta8

6f5f230

0.1.0-beta8 Pre-release

Pre-release

General

Support Python 3.12

Model

Include more training data, expanding the Cookstr and BBC data by 5,000 additional sentences each
Change how the training data is stored. An SQLite database is now used to store the sentences and their tokens and labels. This fixes a long standing bug where tokens in the training data would be assigned the wrong label. csv exports are still available.
Discard any sentences containing OTHER label prior to training model, so a parsed ingredient sentence can never contain anything labelled OTHER.

Processing

Remove other field from ParsedIngredient return from parse_ingredient function.
Added text field to IngredientAmount. This is auto-generated on when the object is created and proves a human readable string for the amount e.g. "100 g"
Allow SINGULAR flag to be set if the amount it's being applied to is in brackets
Where a sentence has multiple related amounts e.g. 14 ounce (400 g) , any flags set for one of the related amounts are applied to all the related amounts
Rewrite the tokeniser so it doesn't require all handled characters to be explicitly stated
Add an option to parse_ingredient to discard isolated stop words that appear in the name, comment and preparation fields.
IngredientAmount.amount elements are now ordered to match the order in which they appear in the sentence.
Initial support for composite ingredient amounts e.g. 1 lb 2 oz is now consider to be a single CompositeIngredientAmount instead of two separate IngredientAmount.
- Further work required to handle other cases such 1 tablespoon plus 1 teaspoon.
- This solution may change as it develops

Assets 2

21 Nov 20:44

strangetom

0.1.0-beta7

d5e2b59

0.1.0-beta7 Pre-release

Pre-release

Automatically download required NLTK resources if they're not found when importing
Require python version <3.12 because python-crfsuite does not yet support 3.12
Various minor tweaks and fixes.

Assets 2

Releases: strangetom/ingredient-parser

1.1.2

Uh oh!

1.1.1

Uh oh!

1.1.0

General

Model

Processing

Uh oh!

1.0.1

Uh oh!

1.0.0

1.0

General

Processing

Uh oh!

0.1.0-beta11

General

Model

Processing

Uh oh!

0.1.0-beta10

Bugfix

Uh oh!

0.1.0-beta9

General

Model

Processing

Contributors

Uh oh!

0.1.0-beta8

General

Model

Processing

Uh oh!

0.1.0-beta7

Uh oh!