Releases: strangetom/ingredient-parser
Releases · strangetom/ingredient-parser
0.1.0-beta6
- Support parsing of preparation steps from ingredients e.g. finely chopped, diced
- These are returned in the
ParsedIngredient.preparationfield instead of the comment field as previously
- These are returned in the
- Removal of StrangerFoods dataset from model training due to lack of PREP labels
- Addition of a BBC Food dataset in the model training
- 10,000 additional ingredient sentences from the archive of 10599 recipes found at https://archive.org/details/recipes-en-201706
- Miscellaneous bug fixes to the preprocessing steps to resolve reported issues
- Handling of fractions with the format: 1 and 1/2
- Handling of amounts followed by 'x' e.g. 1x can
- Handling of ranges where the units were duplicated: 100g - 200g
0.1.0-beta5
- Support the extraction of multiple amounts from the input sentence.
- Change output dataclass to put confidence values with each field.
- The name, comment, other fields are output as an
IngredientTextobject containing the text and confidence - The amounts are output as an
IngredientAmountobject containing the quantity, unit, confidence and flags for whether the amount is approximate or for a singular item of the ingredient.
- The name, comment, other fields are output as an
- Rewrite post-processing functionality to make it more maintainable and extensible in the future.
- Add a model card, which provides information about the data used to train and evaluate the model, the purpose of the model and it's limitations.
- Increase l1 regularisation during model training.
- This reduces model size by a factor of ~4.
- This should improve performance on sentences not seen before by forcing to the model to rely less on labelling specific words.
- Improve the model guide in the documentation.
- Add a simple webapp that can be used to view the output of the parser in a more human-readable way.
Example of the output at this release
>>> parse_ingredient("50ml/2fl oz/3½tbsp lavender honey (or other runny honey if unavailable)")
ParsedIngredient(
name=IngredientText(
text='lavender honey',
confidence=0.998829),
amount=[
IngredientAmount(
quantity='50',
unit='ml',
confidence=0.999189,
APPROXIMATE=False,
SINGULAR=False),
IngredientAmount(
quantity='2',
unit='fl oz',
confidence=0.980392,
APPROXIMATE=False,
SINGULAR=False),
IngredientAmount(
quantity='3.5',
unit='tbsps',
confidence=0.990711,
APPROXIMATE=False,
SINGULAR=False)
],
comment=IngredientText(
text='(or other runny honey if unavailable)',
confidence=0.973682
),
other=None,
sentence='50ml/2fl oz/3½tbsp lavender honey (or other runny honey if unavailable)'
)0.1.0-beta4
- Include new source of training data: cookstr.
- 10,000 additional ingredient sentences from the archive of 7918 recipes (~40,000 total ingredient sentences) found at https://archive.org/details/recipes-en-201706 are now used in the training of the model.
- The parse_ingredient function now returns a
ParsedIngredientdataclass instead of a dict.- Remove dependency on typing_extensions as a result of this
- A model card is now provided that gives details about how the model was trained, performs, is intended to be used, and limitations.
- The model card is distributed with the package and there is a function
show_model_card()that will open the model card in the default application for markdown files.
- The model card is distributed with the package and there is a function
- Improvements to the ingredient sentence preprocessing:
- Expand the list of units
- Tweak the tokenizer to handle more punctuation
- Fix various bugs with the cleaning steps
As a result of these updates the model performance has improved to:
Sentence-level results:
Total: 12030
Correct: 10776
Incorrect: 1254
-> 89.58% correct
Word-level results:
Total: 75146
Correct: 72329
Incorrect: 2817
-> 96.25% correct
0.1.0-beta3
Correct minimum python version to 3.10 due to use of type hints introduced in 3.10.
0.1.0-beta2
- Add new feature that indicates if a token is ambiguous, for example "clove" could be a unit or a name.
- Add preprocessing step to remove trailing periods from certain units e.g.
tsp.becomestsp
0.1.0-beta1
- Change the features extracted from an ingredient sentence
- Replace the word with the stem of the word
- Add feature for follows "plus"
- Change features combining current and next/previous part of speech to just use the next/previous part of speech
- Improve handling of plural units
- Units are made singular before passing to CRF model. The repluralisation of units is based on whether they were made singular in the first place or not.
- Add test cases for the parser_ingredient function
- Not all test cases pass yet - failures will be future improvements (hopefully)
- Better align behaviour of regex parser with CRF-based parser.
0.1.0-alpha4
- Minor fixes to documentation
- Apply re-pluralization to regex parser
0.1.0-alpha3
Incremental changes:
- Fix re-pluralisation of units not actually working in 0.1.0-alpha2.
- Configure development tools in pyproject.toml.
- Fixes to documentation.
- Fixes to NYT data.
- Additional sentence features:
- is_stop_word
- is_after_comma
- Only create features that are possible for the token e.g. there is no prev_word for the first token, so don't create the feature at all instead of using an empty string.
- Refactor code for easier maintenance and flake8 compliance .
0.1.0-alpha2
Incremental changes:
- Improved documentation
- Automatically extract code and version from source files.
- Added regular expression based parser
- This provides an alternative to the CRF-based parser, but is more limited
- Improvements to labelling of New York Times dataset
- Label size modifiers for unit as part of the unit e.g. large clove, small bunch
- Consistent labelling of "juice of..." variants
- Consistent labelling of "chopped"
- Consistent labelling of "package"
- Reduce number of token labelled as OTHER because they were missing from the label
- Fixes and improvements to pre-processing input sentences
- Expand list of units to be singularised
- Fix the preprocessing incorrectly handling words with different cases
- Improve matching and replacement of string numbers e.g. one -> 1
- Fix unicode fraction replacement not replacing
- Improvements to post-processing the model output
- Pluralise units if the quantity is not singular
- Start adding tests to PreProcessor class methods
0.1.0-alpha1
Initial release of package.
There are probably a bunch of errors to fix and improvements to make since this is my first attempt and building a python package.