|
| 1 | +--- |
| 2 | +title: Applications of data valuation |
| 3 | +--- |
| 4 | + |
| 5 | +# Applications of data valuation |
| 6 | + |
| 7 | +Data valuation methods hold promise for improving various aspects |
| 8 | +of data engineering and machine learning workflows. When applied judiciously, |
| 9 | +these methods can enhance data quality, model performance, and cost-effectiveness. |
| 10 | + |
| 11 | +However, the results can be inconsistent. Values have a strong dependency |
| 12 | +on the training procedure and the performance metric used. For instance, |
| 13 | +accuracy is a poor metric for imbalanced sets and this has a stark effect |
| 14 | +on data values. Some models exhibit great variance in some regimes |
| 15 | +and this again has a detrimental effect on values. |
| 16 | + |
| 17 | +While still an evolving field with methods requiring careful use, data valuation can |
| 18 | +be applied across a wide range of data engineering tasks. For a comprehensive |
| 19 | +overview, along with concrete examples, please refer to the [Transferlab blog |
| 20 | +post]({{ transferlab.website }}blog/data-valuation-applications/) on this topic. |
| 21 | + |
| 22 | +## Data Engineering |
| 23 | + |
| 24 | +While still an emerging field, judicious use of data valuation techniques |
| 25 | +has the potential to enhance data quality, model performance, |
| 26 | +and the cost-effectiveness of data workflows in many applications. |
| 27 | +Some of the promising applications in data engineering include: |
| 28 | + |
| 29 | +- Removing low-value data points can reduce noise and increase model performance. |
| 30 | + However, care is needed to avoid overfitting when iteratively retraining on pruned datasets. |
| 31 | +- Pruning redundant samples enables more efficient training of large models. |
| 32 | + Value-based metrics can determine which data to discard for optimal efficiency gains. |
| 33 | +- Computing value scores for unlabeled data points supports efficient active learning. |
| 34 | + High-value points can be prioritized for labeling to maximize gains in model performance. |
| 35 | +- Analyzing high- and low-value data provides insights to guide targeted data collection |
| 36 | + and improve upstream data processes. Low-value points may reveal data issues to address. |
| 37 | +- Data value metrics can also help identify irrelevant or duplicated data |
| 38 | + when evaluating offerings from data providers. |
| 39 | + |
| 40 | +## Model development |
| 41 | + |
| 42 | +Data valuation techniques can provide insights for model debugging and interpretation. |
| 43 | +Some of the useful applications include: |
| 44 | + |
| 45 | +- Interpretation and debugging: Analyzing the most or least valuable samples |
| 46 | + for a class can reveal cases where the model relies on confounding features |
| 47 | + instead of true signal. Investigating influential points for misclassified examples |
| 48 | + highlights limitations to address. |
| 49 | +- Sensitivity/robustness analysis: Prior work shows removing a small fraction |
| 50 | + of highly influential data can completely flip model conclusions. |
| 51 | + This reveals potential issues with the modeling approach, data collection process, |
| 52 | + or intrinsic difficulty of the problem that require further inspection. |
| 53 | + Robust models require many points removed before conclusions meaningfully shift. |
| 54 | + High sensitivity means conclusions heavily depend on small subsets of data, |
| 55 | + indicating deeper problems to resolve. |
| 56 | +- Monitoring changes in data value during training provides insights into |
| 57 | + model convergence and overfitting. |
| 58 | +- Continual learning: in order to avoid forgetting when training on new data, |
| 59 | + a subset of previously seen data is presented again. Data valuation helps |
| 60 | + in the selection of highly influential samples. |
| 61 | + |
| 62 | +## Attacks |
| 63 | + |
| 64 | +Data valuation techniques have applications in detecting data manipulation and contamination: |
| 65 | + |
| 66 | +- Watermark removal: Points with low value on a correct validation set may be |
| 67 | + part of a watermarking mechanism. Removing them can strip a model of its fingerprints. |
| 68 | +- Poisoning attacks: Influential points can be shifted to induce large changes |
| 69 | + in model estimators. However, the feasibility of such attacks is limited, |
| 70 | + and their value for adversarial training is unclear. |
| 71 | + |
| 72 | +Overall, while data valuation techniques show promise for identifying anomalous |
| 73 | +or manipulated data, more research is needed to develop robust methods suited |
| 74 | +for security applications. |
| 75 | + |
| 76 | +## Data markets |
| 77 | + |
| 78 | +Additionally, one of the motivating applications for the whole field is that of |
| 79 | +data markets: a marketplace where data owners can sell their data to interested |
| 80 | +parties. In this setting, data valuation can be key component to determine the |
| 81 | +price of data. Market pricing depends on the value addition for buyers |
| 82 | +(e.g. improved model performance) and costs/privacy concerns for sellers. |
| 83 | + |
| 84 | +Game-theoretic valuation methods like Shapley values can help assign fair prices, |
| 85 | +but have limitations around handling duplicates or adversarial data. |
| 86 | +Model-free methods like LAVA [@just_lava_2023] and CRAIG are |
| 87 | +particularly well suited for this, as they use the Wasserstein distance between |
| 88 | +a vendor's data and the buyer's to determine the value of the former. |
| 89 | + |
| 90 | +However, this is a complex problem which can face practical banal problems like |
| 91 | +the fact that data owners may not wish to disclose their data for valuation. |
0 commit comments