Skip to content

Commit d0c9dba

Browse files
jeremymanningclaude
andcommitted
Comprehensive documentation enhancement for v0.3.0
Major documentation overhaul including: ## Enhanced Documentation Infrastructure - Updated package docstring with comprehensive examples and v0.3.0 info - Enhanced function docstrings across all modules - Modernized configuration with sentence-transformers models - Updated installation guide with Python 3.9+ requirements ## Comprehensive Tutorial System - Enhanced wrangling_basics.ipynb with modern sentence-transformers examples - Created core.ipynb covering configuration system - Created decorators1.ipynb and decorators2.ipynb for decorator patterns - Created util.ipynb covering utility functions - Enhanced io.ipynb for file operations - Created real_world_examples.ipynb with customer feedback analysis - Started interpolation_and_imputation.ipynb for missing data handling ## User Experience Improvements - Organized tutorials into logical sections (Getting Started, Core Concepts, Advanced Applications) - Created comprehensive migration guide for v0.2 → v0.3 transition - Added migration guide to main documentation navigation - Fixed notebook formatting and validation issues ## Modern Examples Throughout - Multiple sentence-transformers model examples with use case guidance - Practical applications including similarity search and clustering - Real-world case studies with visualization and analysis - Updated all examples to use v0.3.0 patterns The documentation now provides comprehensive guidance for both new users and those migrating from v0.2, with practical examples showcasing the full capabilities of the modernized data-wrangler. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
1 parent 4ac047e commit d0c9dba

18 files changed

+1189
-97
lines changed

datawrangler/__init__.py

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,38 @@
1-
"""Top-level package for datawrangler."""
1+
"""
2+
Data Wrangler: Transform messy data into clean pandas DataFrames
3+
4+
Data Wrangler is a Python package that automatically transforms various data types
5+
(arrays, text, files, URLs, etc.) into clean, consistent pandas DataFrame format.
6+
It specializes in text data processing using modern NLP models.
7+
8+
Key Features:
9+
- Automatic data type detection and conversion
10+
- Text embedding using sentence-transformers and sklearn models
11+
- Function decorators for seamless DataFrame integration
12+
- Support for files, URLs, and mixed data types
13+
- Configurable processing pipeline
14+
15+
Basic Usage:
16+
>>> import datawrangler as dw
17+
>>> df = dw.wrangle(your_data)
18+
19+
# With text data using sentence-transformers
20+
>>> text_df = dw.wrangle(["Hello world", "Another text"],
21+
... text_kwargs={'model': 'all-MiniLM-L6-v2'})
22+
23+
# Using the @funnel decorator
24+
>>> @dw.funnel
25+
... def your_function(df):
26+
... return df.mean()
27+
28+
Requirements:
29+
- Python 3.9+
30+
- Optional: Install with [hf] extras for sentence-transformers support
31+
32+
pip install "pydata-wrangler[hf]"
33+
34+
Version: 0.3.0+ (NumPy 2.0+ and pandas 2.0+ compatible)
35+
"""
236

337
__author__ = """Contextual Dynamics Lab"""
438
__email__ = '[email protected]'

datawrangler/core/config.ini

Lines changed: 15 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -71,38 +71,25 @@ n_components = 50
7171
[TruncatedSVD]
7272
n_components = 50
7373

74-
[BytePairEmbeddings]
75-
__model = 'en'
74+
# Sentence-Transformers Models (Modern NLP)
75+
[SentenceTransformer]
76+
__model = 'all-MiniLM-L6-v2'
7677

77-
[ELMoEmbeddings]
78-
__model = 'original'
78+
[all-MiniLM-L6-v2]
79+
# Fast, general-purpose sentence embeddings
80+
# Good for: similarity search, clustering, information retrieval
7981

80-
[FlairEmbeddings]
81-
__model = 'mix-forward'
82+
[all-mpnet-base-v2]
83+
# High-quality sentence embeddings
84+
# Good for: semantic similarity, paraphrase detection
8285

83-
[PooledFlairEmbeddings]
84-
__model = 'mix-forward'
86+
[paraphrase-MiniLM-L6-v2]
87+
# Optimized for paraphrase detection
88+
# Good for: duplicate detection, content deduplication
8589

86-
[TransformerWordEmbeddings]
87-
__model = 'en'
88-
89-
[WordEmbeddings]
90-
__model = 'en'
91-
92-
[StackedEmbeddings]
93-
__model = [embeddings.WordEmbeddings('glove'), embeddings.FlairEmbeddings('mix-forward'), embeddings.FlairEmbeddings('mix-backward')]
94-
95-
[DocumentPoolEmbeddings]
96-
__model = [embeddings.WordEmbeddings('glove')]
97-
98-
[DocumentRNNEmbeddings]
99-
__model = [embeddings.WordEmbeddings('glove')]
100-
101-
[TransformerDocumentEmbeddings]
102-
__model = 'bert-base-uncased'
103-
104-
[SentenceTransformerDocumentEmbeddings]
105-
__model = 'stsb-mpnet-base-v2'
90+
[all-distilroberta-v1]
91+
# Balanced performance and speed
92+
# Good for: general text understanding tasks
10693

10794
[impute]
10895
model = 'IterativeImputer'

datawrangler/zoo/format.py

Lines changed: 24 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -16,30 +16,38 @@
1616

1717
def wrangle(x, return_dtype=False, **kwargs):
1818
"""
19-
Turn messy data into clean data
19+
Turn messy data into clean pandas DataFrames
20+
21+
Automatically detects and converts various data types into consistent DataFrame format.
22+
Specializes in text processing using modern NLP models and handles mixed data types.
2023
2124
Parameters
2225
----------
23-
:param x: data in any format (text, numpy arrays, pandas dataframes, or a mixed list (or nested lists) of those
24-
types). The following datatypes are supported:
26+
:param x: data in any format. Supported datatypes:
2527
- Numpy Arrays, array-like objects, or paths to files that store array-like objects
26-
- Pandas DataFrames, dataframe-like objects, or paths to files that store dataframe-like objects
27-
- Images, or paths to files that store images
28-
- Text, or paths to plain text files
29-
- Mixed lists of the above
30-
:param return_dtype: if True, also return the auto-detected datatype(s) of each dataset you wrangle
31-
:param kwargs: used to control how data are wrangled (e.g., if you don't want to use the default options for each
32-
data type):
33-
- array_kwargs: passed to the datawrangler.zoo.array.wrangle_array function to control how arrays are handled
34-
- dataframe_kwargs: passed to the datawrangler.zoo.dataframe.wrangle_dataframe function to control how
35-
dataframes are handled
36-
- image_kwargs: passed to the datawrangler.zoo.image.wrangle_image function to control how images are handled
37-
- text_kwargs: passed to the datawrangler.zoo.text.wrangle_text function to control how text data are handled
38-
any other keyword arguments are passed to *all* of the wrangle functions.
28+
- Pandas DataFrames, dataframe-like objects, or paths to files that store dataframe-like objects
29+
- Text strings, lists of strings, or paths to plain text files
30+
- Mixed lists or nested lists of the above types
31+
:param return_dtype: if True, also return the auto-detected datatype(s) of each dataset. Default: False
32+
:param kwargs: control how data are wrangled:
33+
- array_kwargs: passed to wrangle_array function to control how arrays are handled
34+
- dataframe_kwargs: passed to wrangle_dataframe function to control how dataframes are handled
35+
- text_kwargs: passed to wrangle_text function to control how text data are handled
36+
Common text_kwargs options:
37+
- {'model': 'all-MiniLM-L6-v2'} for sentence-transformers
38+
- {'model': ['CountVectorizer', 'LatentDirichletAllocation']} for sklearn pipeline
39+
Any other keyword arguments are passed to all wrangle functions.
3940
4041
Returns
4142
-------
4243
:return: a DataFrame, or a list of DataFrames, containing the wrangled data
44+
45+
Examples
46+
--------
47+
>>> import datawrangler as dw
48+
>>> df = dw.wrangle([1, 2, 3]) # Convert array to DataFrame
49+
>>> text_df = dw.wrangle(["Hello", "World"], text_kwargs={'model': 'all-MiniLM-L6-v2'})
50+
>>> mixed_df, dtypes = dw.wrangle([df, text_df], return_dtype=True)
4351
"""
4452

4553
deep_kwargs = {}

datawrangler/zoo/text.py

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,8 @@ def robust_is_hugging_face_model(x):
101101
"""
102102
Wrapper for is_hugging_face_model that also supports strings-- e.g., the string 'all-MiniLM-L6-v2' will be a valid
103103
hugging-face model when checked with this function, because it's a sentence-transformers model name.
104+
105+
Parameters
104106
----------
105107
:param x: a to-be-tested model object or a string
106108
@@ -114,19 +116,25 @@ def robust_is_hugging_face_model(x):
114116

115117
def get_text_model(x):
116118
"""
117-
Given an valid scikit-learn or hugging-face model, or a string (e.g., 'LatentDirichletAllocation' or
118-
'TransformerDocumentEmbeddings') matching the name of a valid scikit-learn or hugging-face model, return
119-
a callable function or class constructor for the given model.
119+
Given a valid scikit-learn or sentence-transformers model, or a string matching the name of a valid model,
120+
return a callable function or class constructor for the given model.
120121
121122
Parameters
122123
----------
123-
:param x: an object to turn into a valid scikit-learn or hugging-face model (e.g., an already-valid model or a
124-
string)
124+
:param x: an object to turn into a valid scikit-learn or sentence-transformers model. Can be:
125+
- An already-valid model instance
126+
- A string matching sklearn model names (e.g., 'LatentDirichletAllocation', 'CountVectorizer')
127+
- A string matching sentence-transformers model names (e.g., 'all-MiniLM-L6-v2', 'all-mpnet-base-v2')
125128
126129
Returns
127130
-------
128-
:return: A valid scikit-learn or hugging-face model (or None if no model matching the given description can be
129-
found)
131+
:return: A valid scikit-learn or sentence-transformers model (or None if no model matching the given
132+
description can be found)
133+
134+
Examples
135+
--------
136+
>>> get_text_model('LatentDirichletAllocation') # sklearn model
137+
>>> get_text_model('all-MiniLM-L6-v2') # sentence-transformers model
130138
"""
131139
if is_sklearn_model(x) or is_hugging_face_model(x):
132140
return x # already a valid model

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ Wrangle your messy data into consistent well-organized formats!
1010

1111
readme
1212
installation
13+
migration_guide
1314
tutorials
1415
api
1516
contributing

docs/installation.rst

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,44 @@
44
Installation
55
============
66

7+
Requirements
8+
------------
9+
10+
- **Python 3.9+** (v0.3.0+ requires modern Python versions)
11+
- NumPy 2.0+ and pandas 2.0+ compatible
12+
- Optional: HuggingFace transformers for advanced text processing
713

814
Stable release
915
--------------
1016

17+
**Basic Installation**
18+
1119
To install datawrangler, run this command in your terminal:
1220

1321
.. code-block:: console
1422
1523
$ pip install pydata-wrangler
1624
25+
This installs the core functionality including sklearn-based text processing.
26+
27+
**Full Installation with ML Libraries**
28+
29+
For advanced text processing with sentence-transformers models:
30+
31+
.. code-block:: console
32+
33+
$ pip install "pydata-wrangler[hf]"
34+
35+
This includes sentence-transformers, transformers, and related HuggingFace libraries.
36+
37+
**Upgrade from Previous Versions**
38+
39+
If upgrading from v0.2.x, ensure you have Python 3.9+:
40+
41+
.. code-block:: console
42+
43+
$ pip install --upgrade "pydata-wrangler[hf]"
44+
1745
This is the preferred method to install datawrangler, as it will always install the most recent stable release.
1846

1947
If you don't have `pip`_ installed, this `Python installation guide`_ can guide

0 commit comments

Comments
 (0)