Version: 1.0.0 (Preview) Date: November 2025 License: CC BY-NC 4.0 International Author: Hongyi Shui Target Release: Q3-Q4 2026
Problem: Existing ad fraud datasets are 8-13 years old with only click timestamps—no behavioral signals.
Solution: 3M+ sessions with behavioral biometrics (mouse dynamics, scroll patterns, touch events) at 100ms resolution.
What's Novel:
- First dataset combining behavioral features + fraud labels + adversarial samples at scale
- Physics-based ground truth (superhuman velocity = bot) with confidence scores
- Multi-tier labels enabling supervised, semi-supervised, and self-supervised research
Research Impact: Enables temporal models (LSTMs, Transformers), adversarial robustness testing, and industry-specific fraud analysis.
Interested? Email contact@adtruth.io | Target Release: Q3-Q4 2026
The ad fraud research community lacks datasets with behavioral biometrics. Existing public datasets (TalkingData 2017, FDMA 2012, Avazu 2014) contain only click logs and timestamps—no mouse movements, scroll patterns, or interaction dynamics. Meanwhile, behavioral biometrics datasets focus on user authentication, not ad fraud.
To our knowledge, AdTruth is the first large-scale behavioral ad fraud dataset, providing:
- Rich behavioral signals at 100ms temporal resolution (mouse velocity, scroll depth, touch dynamics)
- 200+ derived behavioral features per session
- Physics-based ground truth with confidence scores (not heuristic labels)
- CAPTCHA-verified labels via Cloudflare Turnstile
- Session-level sequences enabling temporal modeling
- Adversarial bot samples for robustness research
- Modern 2025-2026 production data (vs. 7-13 year old alternatives)
AdTruth is a large-scale ad fraud detection dataset with 3M+ web sessions from SMB websites, featuring 200+ behavioral features at 100ms resolution. Unlike existing datasets (TalkingData, FDMA, Avazu) that contain only click timestamps, AdTruth captures mouse dynamics, scroll patterns, click timing, and touch events.
Key features:
- Multi-tier labels: Hard (physics-based + CAPTCHA), Soft (ensemble scores), Unlabeled
- 300K+ hard-labeled sessions with 90-100% confidence
- 50K+ adversarial bot samples across 6 evasion techniques
- 20 SMB industry categories over 18+ months
- Comparison with Existing Datasets
- Dataset Description
- Data Collection Methodology
- Ground Truth Labeling
- Data Dictionary
- Exploratory Statistics
- Known Limitations and Biases
- Ethical Considerations
- Citation and Contact
| Feature | TalkingData (2017) | FDMA 2012 | Avazu (2014) | AdTruth (2026) |
|---|---|---|---|---|
| Scale | 184M clicks | 8M clicks | 40M clicks | 3M+ sessions |
| Data Age | 8+ years | 13+ years | 11+ years | Current (2026) |
| Behavioral signals | No | No | No | Yes |
| Ad fraud context | Yes | Yes | Partial | Yes |
| Temporal resolution | 1 second | N/A | 1 hour | 100ms |
| Session sequences | No | No | No | Yes |
| Feature dimensions | 7 | 12 | 24 | 200+ |
| Adversarial samples | No | No | No | Yes |
| Multi-tier labels | No | No | No | Yes |
| Confidence scores | No | No | N/A | Yes |
| Industry metadata | No | No | No | Yes |
| CAPTCHA verification | No | No | No | Yes |
| Semi-supervised ready | No | No | No | Yes |
TalkingData (2017): The most-cited ad fraud dataset contains 184 million mobile app install records but provides only 7 features: IP, app, device, OS, channel, click time, and attribution. No behavioral signals. Binary labels with unknown methodology. Data is now 8+ years old—modern bots have evolved significantly.
FDMA 2012 BuzzCity: Publisher-level fraud detection with ~8 million clicks and 12 features. Heavily anonymized, missing values, mobile-only. 13+ years old. Labels derived from unknown heuristics.
Avazu (2014): Click-through rate prediction dataset. Not designed for fraud detection—contains no fraud labels.
While not ad fraud datasets, behavioral biometrics research provides relevant methodology:
- Balabit Mouse Dynamics (2016): Rich mouse movement data for user authentication. 10 users in controlled lab environment. Demonstrates behavioral signals can distinguish individuals.
- BeCAPTCHA-Mouse: Mouse dynamics for bot detection in CAPTCHA contexts. Smaller scale but validates behavioral approach.
- Can behavioral biometrics detect sophisticated bot farms? — First dataset with 200+ behavioral features AND fraud labels at scale
- How do adversarial bots adapt to detection? — Dataset includes 50K+ labeled evasion samples
- Do temporal models (LSTMs, Transformers) outperform static classifiers? — Full session sequences available
- How do fraud patterns differ across industries? — 20 SMB industry categories
- Can graph neural networks detect coordinated fraud? — Network/IP clustering features
- How effective is semi-supervised learning for fraud detection? — Multi-tier label structure with unlabeled samples
| Attribute | Value |
|---|---|
| Total Sessions | 3,000,000+ |
| Hard-Labeled Sessions | 300,000+ |
| Soft-Labeled Sessions | 1,800,000+ |
| Unlabeled Sessions | 900,000+ |
| Behavioral Features | 200+ derived dimensions |
| Temporal Resolution | 100ms sampling (10/second) |
| Time Period | 18+ months longitudinal |
| Unique Websites | 250+ SMB websites |
| Industry Categories | 20 |
| Geographic Coverage | 50+ countries |
| Adversarial Samples | 50,000+ labeled evasion attempts |
- Production traffic: Real visitor sessions from live SMB websites
- Adversarial traffic: Controlled bot traffic with documented evasion techniques
- Traffic mix: Paid search, paid social, organic, direct, referral
- Scale + Depth: 3M sessions with 200+ features each—no existing dataset combines both
- Temporal precision: 100ms resolution enables micro-pattern analysis
- Session sequences: Full user journeys, not isolated events
- Adversarial samples: 50K+ labeled evasion attempts for robustness research
- Multi-tier labels: Hard + soft labels enable supervised and semi-supervised research
- CAPTCHA verification: Cloudflare Turnstile provides additional ground truth
- Longitudinal coverage: 18+ months captures seasonal fraud patterns
The AdTruth SDK captures behavioral signals at 100ms resolution (10 samples per second):
| Signal Category | Features | Description |
|---|---|---|
| Velocity | avg, max, min, std, percentiles | Movement speed (px/s) |
| Acceleration | avg, max, jerk | Rate of velocity change |
| Direction | angle_changes, curvature | Path characteristics |
| Pauses | pause_count, pause_duration | Movement interruptions |
| Trajectory | path_efficiency, straightness | Movement quality |
| Signal Category | Features | Description |
|---|---|---|
| Depth | max, avg, progression | Scroll extent |
| Velocity | avg, max, smoothness | Scroll speed |
| Patterns | reversals, pauses, acceleration | Scroll behavior |
| Timing | time_to_scroll, scroll_duration | Temporal patterns |
| Signal Category | Features | Description |
|---|---|---|
| Timing | time_to_first, intervals, rhythm | Click temporal patterns |
| Precision | target_accuracy, miss_rate | Click accuracy |
| Sequences | double_clicks, rage_clicks | Click patterns |
| Signal Category | Features | Description |
|---|---|---|
| Taps | pressure, duration, intervals | Tap characteristics |
| Swipes | velocity, direction, length | Swipe patterns |
| Gestures | pinch, rotate, multi-touch | Complex interactions |
| Signal Category | Features | Description |
|---|---|---|
| Temporal | session_duration, active_time, idle_time | Time patterns |
| Navigation | page_sequence, back_buttons, tab_switches | Journey patterns |
| Engagement | interaction_density, attention_score | Behavior quality |
| Feature | Description |
|---|---|
ip_reputation_score |
Fraud risk score (0.0-1.0) |
is_datacenter |
Hosting/datacenter IP flag |
is_proxy |
Proxy detection |
is_vpn |
VPN detection |
ip_cluster_id |
Network clustering for graph analysis |
ip_velocity |
Unique IPs per device fingerprint |
To enable robustness research, we include labeled adversarial bot traffic:
| Bot Type | Evasion Technique | Sample Count |
|---|---|---|
naive_headless |
Basic Selenium/Puppeteer | 10,000+ |
human_timing |
Injected realistic delays | 10,000+ |
mouse_replay |
Recorded human mouse paths | 8,000+ |
ml_generated |
GAN-generated behavioral patterns | 8,000+ |
residential_proxy |
Rotating residential IPs | 7,000+ |
fingerprint_spoof |
Canvas/WebGL manipulation | 7,000+ |
Each adversarial sample is labeled with:
is_adversarial: trueevasion_technique: taxonomy categoryevasion_sophistication: 1-5 scaledetection_difficulty: estimated based on signal analysis
AdTruth provides three tiers of labels to support diverse research approaches:
| Tier | Confidence | Method | Coverage | Use Case |
|---|---|---|---|---|
| Hard Labels | 90-100% | Physics impossibilities + CAPTCHA | 300K+ sessions | Supervised learning |
| Soft Labels | 50-90% | Ensemble scoring | 1.8M+ sessions | Semi-supervised, weak supervision |
| Unlabeled | N/A | N/A | 900K+ sessions | Self-supervised, unsupervised |
Hard labels are assigned when behavioral impossibilities are detected:
| Impossibility Type | Detection Criteria | Confidence |
|---|---|---|
superhuman_velocity |
Mouse velocity > 5000 px/s | 0.95 |
instant_reaction |
First click < 200ms | 0.95 |
impossible_scroll |
90% depth in < 100ms | 0.95 |
zero_interaction_ghost |
10s+ with zero input | 0.90 |
rapid_fire_clicks |
> 100 clicks/second | 1.00 |
fast_exit_no_input |
< 3s with zero interaction | 1.00 |
In addition to physics-based detection, we use Cloudflare Turnstile challenges on a subset of sessions:
- Challenge rate: Approximately 5% of sessions receive CAPTCHA challenges
- Pass = Human label: Sessions passing Turnstile receive
humanhard label (0.95 confidence) - Fail = Bot label: Sessions failing Turnstile receive
bothard label (0.98 confidence) - Skip/Exit: Sessions where users skip or exit before completing are excluded from hard labels to avoid false positives
This dual approach (physics impossibilities + CAPTCHA verification) provides complementary ground truth signals.
For sessions without hard labels, soft labels are derived from multiple signals:
{
"fraud_probability": 0.73,
"fraud_signals": {
"ip_reputation_score": 0.85,
"behavioral_anomaly_score": 0.68,
"device_fingerprint_score": 0.45,
"temporal_anomaly_score": 0.72
},
"label_confidence": 0.65
}Each session can have multiple fraud signals:
fraud_taxonomy:
├── bot_automated
│ ├── headless_browser
│ ├── selenium_webdriver
│ └── api_scripted
├── bot_datacenter
│ ├── cloud_provider
│ └── hosting_service
├── bot_proxy
│ ├── residential_proxy
│ ├── vpn
│ └── tor
├── behavioral_anomaly
│ ├── mouse_anomaly
│ ├── scroll_anomaly
│ └── timing_anomaly
└── coordinated_fraud
├── click_farm
└── ip_rotation
For researchers preferring fully-labeled data, we provide a curated subset:
- 300,000+ sessions with hard labels only
- Balanced distribution: ~40% bot, ~40% human, ~20% suspicious
- All sessions include full 200+ behavioral features
| Field | Type | Description |
|---|---|---|
session_id |
UUID | Unique session identifier |
visitor_id |
UUID | Cross-session visitor tracking |
website_id |
UUID | Website identifier (anonymized) |
timestamp |
TIMESTAMPTZ | Session start time (UTC) |
Example 1: Verified Human Session
{
"session_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"hard_label": "human",
"hard_label_confidence": 0.95,
"label_method": "captcha_pass",
"mouse": {
"velocity": {"avg": 450, "max": 1200, "std": 180},
"acceleration": {"avg": 85, "max": 340},
"direction": {"angle_changes": 47, "curvature": 0.32},
"pauses": {"count": 12, "avg_duration": 450},
"trajectory": {"efficiency": 0.78, "straightness": 0.65},
"sample_count": 284
},
"scroll": {
"depth": {"max": 0.85, "avg": 0.42},
"velocity": {"avg": 120, "max": 890},
"patterns": {"reversals": 3, "smooth_ratio": 0.89}
},
"clicks": {
"count": 4,
"intervals": [2340, 1890, 3200],
"time_to_first": 3400,
"precision_score": 0.92
},
"session": {
"duration": 47000,
"active_time": 38000,
"idle_time": 9000,
"pages_viewed": 3,
"interaction_density": 0.81
},
"ip_reputation_score": 0.12,
"ip_type": "residential"
}Example 2: Confirmed Bot Session (Physics Impossibility)
{
"session_id": "b2c3d4e5-f6a7-8901-bcde-f23456789012",
"hard_label": "bot",
"hard_label_confidence": 1.00,
"label_method": "rapid_fire_clicks",
"fraud_taxonomy": ["bot_automated", "behavioral_anomaly"],
"mouse": {
"velocity": {"avg": 2850, "max": 8500, "std": 45},
"acceleration": {"avg": 1450, "max": 4100},
"direction": {"angle_changes": 3, "curvature": 0.98},
"pauses": {"count": 0, "avg_duration": 0},
"trajectory": {"efficiency": 0.99, "straightness": 0.99},
"sample_count": 89
},
"scroll": {
"depth": {"max": 0.95, "avg": 0.95},
"velocity": {"avg": 8500, "max": 12000},
"patterns": {"reversals": 0, "smooth_ratio": 0.02}
},
"clicks": {
"count": 47,
"intervals": [85, 82, 88, 84, 86],
"time_to_first": 145,
"precision_score": 1.00
},
"session": {
"duration": 4200,
"active_time": 4200,
"idle_time": 0,
"pages_viewed": 1,
"interaction_density": 0.99
},
"ip_reputation_score": 0.89,
"ip_type": "datacenter"
}Key differences: The bot session shows superhuman mouse velocity (8500 px/s max), perfectly regular click intervals (~85ms), zero pauses, instant scroll to 95% depth, and datacenter IP. This example represents extreme automated behavior—in practice, bots exhibit varying sophistication levels (e.g., replay bots show realistic movement but unnatural timing).
| Field | Type | Description |
|---|---|---|
hard_label |
ENUM | bot, human, suspicious, null |
hard_label_confidence |
FLOAT | 0.90-1.00 for hard labels |
soft_label_probability |
FLOAT | Ensemble fraud probability 0.0-1.0 |
fraud_signals |
JSON | Individual signal scores |
fraud_taxonomy |
ARRAY | Multi-label taxonomy tags |
is_adversarial |
BOOLEAN | Adversarial bot sample flag |
evasion_technique |
STRING | Evasion method if adversarial |
| Field | Type | Description |
|---|---|---|
traffic_category |
ENUM | Paid Search, Paid Social, Organic, Direct, Referral |
traffic_source |
STRING | google, facebook, bing, etc. |
campaign_id |
UUID | Campaign identifier (anonymized) |
ad_platform |
STRING | Advertising platform |
| Field | Type | Description |
|---|---|---|
ip_reputation_score |
FLOAT | 0.0-1.0 fraud risk |
ip_type |
ENUM | residential, datacenter, proxy, vpn |
ip_cluster_id |
INT | Network clustering group |
geo_country |
STRING | Country code |
geo_region |
STRING | Region/state |
| Metric | Value |
|---|---|
| Total sessions | 3,000,000+ |
| Hard-labeled sessions | 300,000+ |
| Soft-labeled sessions | 1,800,000+ |
| Behavioral feature dimensions | 200+ |
| Temporal resolution | 100ms |
| Adversarial samples | 50,000+ |
| Unique websites | 250+ |
| Industries | 20 categories |
| Time span | 18+ months |
| Geographic coverage | 50+ countries |
| Label | Hard-Labeled Subset (300K) | Full Dataset (Soft) |
|---|---|---|
| Bot (high confidence) | 40% (~120,000) | 12% |
| Human (verified) | 40% (~120,000) | 48% |
| Suspicious | 20% (~60,000) | 10% |
| Unlabeled | 0% | 30% |
- Convenience sampling: SMB websites using AdTruth platform
- Geographic skew: US-concentrated (55%), with growing international coverage
- Industry skew: Over-representation of legal, e-commerce, healthcare
- Hard labels only for extreme cases: Borderline fraud may be unlabeled
- Adversarial samples are controlled: May not represent all wild evasion techniques
- Soft label accuracy: Ensemble scores are estimates, not ground truth
- JavaScript dependency: Sessions without JS execution are excluded
- Mobile behavior differences: Touch patterns have different distributions than desktop
- Privacy tool interference: Ad blockers may affect fingerprinting signals
| Signal | May Indicate Fraud | Alternative Explanation |
|---|---|---|
| Fast exit | Bot | Wrong page / slow load |
| Zero interaction | Bot | Reading without scrolling |
| Datacenter IP | Bot | Corporate VPN |
| High velocity | Bot | Gaming mouse / high DPI |
Anonymization applied:
- IP addresses: Hashed, with reputation scores preserved
- Website IDs: Random identifiers
- Timestamps: Generalized to hour
- URLs: Path patterns only, domains removed
- No PII collected or stored
Informed consent:
- All websites using AdTruth SDK display privacy notices informing visitors of data collection
- Visitors can opt out via browser Do Not Track (DNT) signals
- Data collection complies with GDPR Article 6(1)(f) (legitimate interest for fraud prevention)
- Adversarial bots run only on consenting test websites
- No real advertiser budgets affected
- Techniques documented to improve industry defenses
Acknowledged risks:
- Dataset could train adversarial bots
- Thresholds could inform evasion techniques
Mitigations:
- Non-commercial license
- Detection thresholds updated independently
- Academic use agreement required
This dataset was collected as part of AdTruth's production fraud detection platform. While traditional IRB approval was not required for commercial service operations, we conducted an internal ethics review following ACM guidelines for data release. No personally identifiable information (PII) was collected, and all data was anonymized prior to dataset compilation.
Citation information will be available upon dataset release (Target: Q3-Q4 2026)
TBA - Zenodo DOI will be assigned upon publication
Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
| Milestone | Target Date |
|---|---|
| Documentation preview | November 2025 |
| Researcher feedback period | December 2025 - March 2026 |
| Beta access (selected partners) | April - June 2026 |
| Public release | Q3-Q4 2026 |
| Version | Target Date | Contents |
|---|---|---|
| v1.0.0 | Q3-Q4 2026 | Initial release (18 months of data) |
| v1.1.0 | Q1 2027 | +6 months longitudinal data |
| v1.2.0 | Q3 2027 | Additional adversarial samples |
- Author: Hongyi Shui
- Email: contact@adtruth.io
- Website: https://adtruth.io
- GitHub: https://github.com/papa-torb/adtruth
Researchers interested in early access for academic publications can request a representative sample for validation studies. We are particularly interested in collaborations exploring:
- Temporal deep learning models (LSTMs, Transformers) for sequence-based detection
- Graph neural networks for coordinated fraud detection
- Semi-supervised and weak supervision approaches
- Cross-industry generalization and transfer learning
What we offer: Pre-release sample dataset + co-authorship opportunities
What we need: Research expertise + publication commitment
To apply: Email contact@adtruth.io with subject "AdTruth Research Access" including your research focus and relevant publications.
Card, S. K., Moran, T. P., & Newell, A. (1983). The Psychology of Human-Computer Interaction. Lawrence Erlbaum Associates.
Feher, C., et al. (2012). User identity verification via mouse dynamics. Information Sciences, 201, 19-36.
Oentaryo, R. J., et al. (2014). Detecting click fraud in online advertising: A data mining approach. Journal of Machine Learning Research, 15(99), 99-140.
This is a preview document for the AdTruth dataset. Target release: Q3-Q4 2026. For early access inquiries or research collaboration, please contact the author.