footer: 　　　　　　　　　　　　　　　　　　　　　　　 [.footer-style: #2F2F2F, alignment(right), line-height(1), text-scale(3.5), z-index(10000)]

[.hide-footer]

^ ...

[.hide-footer]

whoami

Ruben Berenguel (@berenguel)
PhD in Mathematics
Lead Data Engineer at Hybrid Theory
Preferred stack is Python, Go and Scala

^ I have divided this presentation in 3 sections: first I will talk about adtech, cookies and the identity problem. Then I will explain how we can solve the problem using an identity graph, and finally how we can process this graph fast with Apache Spark

[.hide-footer]

Part 1: Set up
Adtech
*What are cookies, really?*
What is cookie mapping?
The identity problem

^ I will start by setting up the problem: how programmatic advertising uses cookie to create targeted ads, and the user-identity problem that derives from it

[.hide-footer]

[.build-lists: true]

Programmatic adtech

Find users satisfying some criteria

Visited pages of category ABC
Are interested in concept XYZ
Are likely to want to buy from our client RST

^ Note that the third bullet needs some kind of machine learning or using very smart humans. The final goal is that if you are going to see an ad, it better is a relevant one

[.hide-footer]

To find them we need

Their browse and/or behaviour data 🖥 🛒

[.hide-footer]

To find them we need

Their browse and/or behaviour data 🖥 🛒

To deliver for our clients we need

A way to show them ads 📣

^ All boils down to identifying specific users online. But what identifies a user online, so we can show an ad only to someone who is going to be interested in it? That's cookies 🍪🥠

[.hide-footer]

Cookies

Are used to help websites　　

track events

and state

as users browse

^ Note that most cookies are going away in the next years due to privacy concerns and regulations.

[.hide-footer]

There are two kind of cookies

First party (session, state…)

Third party (event tracking…)

^ Session/state would be your basket, whether you are logged in, etc. Event tracking ranges from advertising (or related) stuff to analytics (like Google)

[.footer: ]

^ A user is browsing online

[.footer: ]

^ A first party webserver is serving a webpage

[.footer: ]

^ A _third party webserver is serving a pixel on the page

[.footer: ]

^ This server sets a cookie on the user/browser combination

[.footer: ]

^ Stamp!

[.footer: ]

^ When the user goes to the login area…

[.footer: ]

^ The first party server keeps track of that state by setting a cookie on the user/browser

[.footer: ]

^ Stamp!s

[.footer: ]

^ Cookies are associated to the domain that set them, and they are not accessible from others. So, the first party server knows nothing about the third party cookie (and conversely)

[.footer: ]

[.hide-footer]

We get browse data from users on the web from data providers¹

^ So, in the end we get large amounts of batch data of what users do

[.hide-footer]

We get browse data from users in the web from data providers

We get browse data from users browsing our client website²

^ And data from what users do on our clients websites, which we handle with our third party 🍪

[.hide-footer]

How do we connect both data sources?

The identifiers we get from both sides are unrelated!

😱

^ Those wall sockets look scared

[.hide-footer]

Mapping servers

and

the mapping chain

^ For advertising to be effective, we need to connect these two data sources, what happens on our clients' websites and what happens around the world

[.footer: ]

^ In cookie mapping, a user is browsing

[.footer: ]

^ A pixel fires, a cookie is set

[.footer: ]

^ The destination server (mapping server) redirects to another server

[.footer: ]

^ that sets a cookie

[.footer: ]

^ and calls back to the initial server, reporting back the identifier that has been set in the cookie

[.footer: ]

^ This can repeat any number of times (although less is better)

[.footer: ]

^ This is what the chain looks like then: a chain of identifiers (or cookies) that are tied to a user. This is as seen from the server initiating the redirections

[.hide-footer]

The identity problem

^ 🥸 Greetings, good man. Might I trouble you for a drink? Homer? Who is Homer?

^ These are a few chains. A virtual id for each chain is added when the redirections start and keep track of the callbacks. The identity problem appears when you try to keep the chains up to date as days pass: some cookies degrade fast, and a user may have several identifiers for each partner. Handling this adhoc results in mapping issues

[.hide-footer]

[.build-lists: true]

Basic solution

Coalesce (merge on nulls) chains based on one id
Is not as complete as the graph approach because…
Requires one stable identifier

^ (Or stable enough identifier). This solution can be applied to batches of chains without requiring any lookback. The coalescing either kills other identifiers (and requires a stable identifier) or results in an overwrite of identifiers

[.hide-footer]

^ What do we do in this situation? Either id 77 goes with circle or with gamma. Unless…

[.hide-footer]

Part 2: The identity graph
Rethink the problem as a graph
Connected components in big data

[.footer: ]

^ Recall the table with chains

[.footer: ]

^ Think them as nodes in graphs

[.footer: ]

^ Remove useless info

[.footer: ]

^ Three connected components, three users

[.footer: ]

^ Ignore useless sources that add no information

[.footer: ]

^ This new information goes here

[.footer: ]

^ And here

[.footer: ]

^ With a coalescing solution, you would have 4 users, or best case scenario the system would resolve that user 42 is one o user 2 or user gamma.

[.footer: ]

^ By looking for connected components you realise there are actually 2 users instead of 3. How do we find connected components with Spark?

[.hide-footer]

Enter `GraphFrames`

[.hide-footer]

Basic Spark graph framework: GraphX

It is message-propagation³, graph-parallel, low level

^ The Pregel "message passing model" is very handy in its flexibility. It allows to create non-deterministic identity graphs as well, like the graph you could create to figure out cross-device identities (since cookies are set per browser)

[.hide-footer]

Basic Spark graph framework: GraphX

It is message-propagation (Pregel API) graph-parallel, low level

GraphFrames are to DataFrames as GraphX is to RDDs

^ But a higher level API is more convenient

[.hide-footer]

Alternatives considered…


Apache Giraph	harder maintenance
Neo4J	harder scalability
AWS Neptune	too new

^ Except Giraphe, most options available are graph databases and not graph computation engines. The difference is important for our problem: we want to find connected components, not query. Graph databases are optimised for querying (and offer custom languages for it, like Gremlin)

[.hide-footer]

Connected components in big data

The Large Star - Small Star algorithm

^ The algorithm converts each connected component in a star (a cartwheel). There are several alternative algorithms that improve on large star - small star, like union-find-shuffle and partition-aware connected components

[.footer: ]

^ Start with a graph, directed or undirected

[.footer: ]

^ Randomly assign a different integer to all nodes. In GraphFrames this is done by adding a monotinically-increasing id to each node. Next step is a preparation step for humans, as the first step in large star

[.footer: ]

^ First start with the Large Star step. This step is done for the local neighborhood of each node. To make it clearer, let's point from large to small first.

[.footer: ]

^ The large star step is done per node, where we need to consider the immediate neighborhood. For example, let's check node 7

[.footer: ]

^ It has two neighbors, 10 and 3. In this step, we connect all strictly larger neighbors (including self) to the minimum neighbor

[.footer: ]

^ I.e. we connect 10 and 7 itself to 3

[.footer: ]

^ This is done to all nodes. You can imagine water flowing down the slopes. 3 doesn't go to 1 because it's smaller than 9 for example.

[.footer: ]

^ After the large star step, we come to the small star step.

[.footer: ]

^ This is again a node-local algorithm. Let's focus on node 9

[.footer: ]

^ and its neighbors.

[.footer: ]

^ In this case, we need to connect the strictly smaller neighbors (including self) to the minimal neighbor. In this case, we connect 3 and 9 to 1.

[.footer: ]

^ And likewise for all other nodes and its neighbors (in this case there are no additonal changes). All these node-local processes can be easily computed in a "SQL" way that can be parallelized by Spark

[.footer: ]

^ Now we iterate, by applying Large Star again, which will link all neighbors of 3 (and 3) to 1.

[.footer: ]

^ And we end up with a star, where all nodes are connected to a node with minimal id. We use this id as the connected component id. The algorithm is \mathcal{O}(\log^2\text{number of nodes}), although in practice it is significantly faster, because convergence depends on the height (or diameter) of the worst component. It can have horrible last reducer problems due to very large components in that case.

[.hide-footer]

Input should be formatted as a `DataFrame` of edges

src	dst	(…)
partner_1_𝟷	partner_2_⍺	1617963647…
partner_1_2	partner_3_⭘	1617963647…
partner_2_𝛄	partner_3_△	1617963654…
⁞	⁞	⁞

^ We can additionally pass any information related with an edge (generically call it label), most useful would be the timestamp of the event.

[.hide-footer]

Output layout

Component Id	Partner / Cookie Id	Timestamp
10234	partner_1_𝟷	1617963647
10234	partner_2_⍺	1617963647
5534	partner_1_2	1617963654
⁞	⁞	⁞

[.hide-footer]

[.build-lists: true]

To map from Partner A to Partner B

Given an id Partner_A_X,
we find the connected component id for the node Partner_A_X,
we find all the nodes of the form Partner_B_* for the component above

[.hide-footer]

[.build-lists: true]

Impact of moving from an adhoc process to a graph process

Partner integration: from 2 months to 1 week
Users mapped uplift: around 20%
Mapping "quality": competitive (within 5%) with industry leaders

[.hide-footer]

Part 3: Speed up and improvements
Data cleanup
Cheap refresh
Machine tuning
Potential improvement

[.hide-footer]

Data cleanup

^ There are several steps required as part of data cleaning for a graph computation like this one.

[.hide-footer]

Invalid identifiers

[.hide-footer]

Invalid identifiers

Like na or 0 or xyz

(or fraudulent calls to a mapping server)

^ You can analyze your graph data before doing anything and remove the most glaring invalid identifiers, but as your graph grows you'll find more and more edge cases to clean. Luckily, cleaning a graph is easy: you just destroy a component

[.hide-footer]

Node pruning

[.hide-footer]

Node pruning

To prevent huge components

In the cookie case, by expiring cookies not seen in `N` days

^ Any node you haven't seen in M days is basically useless in advertising (for some value of M) and we leverage this here to prevent having large components

[.hide-footer]

Component destruction

^ 🤘

[.hide-footer]

Component destruction

To limit component size artificially

If the data is fully clean we can assume no user has more than `M` identifiers

^ Welcome to the connected components, we've got fun and games. Destroying a component is the last resort, and only to be done for very large components, and sparingly

[.hide-footer]

What is the fastest way to build a 2 billion nodes graph daily?　

🤔

[.hide-footer]

What is the fastest way to build a 　　　2 billion nodes graph daily?

Not doing it

^ 🥁

[.hide-footer]

The easy way

[.footer: ]

^ We have an existing graph. We can assume it exists in some form. We have the chain data from a batch, maybe daily, maybe a few weeks or hours depending on your problem. We run the connected components algorithm on it

[.footer: ]

^ And now we have two sets of stars, the existing ones and the new ones. But not all of them are alike

[.footer: ]

^ Some have nodes in common between existing and new, some do not

[.footer: ]

^ We process them separately: those that have no nodes in common are clean, the others are tainted

[.footer: ]

^ The clean ones are good to go, but for the tainted ones, we repeat the process of running large star - small star, with these new edges

[.footer: ]

^ And we end up with a (very large) consolidated graph

[.hide-footer]

Machine tuning

for large graphs

^ In Apache Spark of course

[.hide-footer]

[.build-lists: true]

Go large and tune up

the process is memory hungry
the process is shuffle hungry

[.hide-footer]

[.build-lists: false]

Go large and tune up

the process is memory hungry
the process is shuffle hungry

better to have few, large, machines

^ This is a rule of thumb: start with large machines, see how it behaves and what kind of query plans appear and then tune from there

[.hide-footer]

[.build-lists: false]

Go large and tune up

the process is memory hungry
the process is shuffle hungry

better to have few, large, machines

and give executors more memory than you'd think

[.hide-footer]

Impact of Adaptive Query Execution (AQE)

AQE uses runtime statistics to help the Cost Based Optimizer (CBO) and speed up Spark

^ The CBO tries to re-arrange queries depending on cost statistics, but needs to have updated information on all the tables. AQE keeps these up to date as the computations flow, feeding the CBO with fresh data

[.hide-footer]

Impact of Adaptive Query Execution (AQE)

AQE uses runtime statistics to help the Cost Based Optimizer (CBO) and speed up Spark

Using Spark 3.x with AQE active has a 30-40% speed up

^ This just requires you set a flag in your SparkConf (spark.sql.adaptive.enabled=true)

[.hide-footer]

[.build-lists: true]

Further improvements

Easy: Move storage to Delta Lake
Hard: implement union-find-shuffle instead of large star - small star

^ With Delta Lake we'd have the additional advantage of Z-ordering when executing certain joins, but should be a small win. UFS is supposed to be significantly faster than large star - small star, but implementing something like this requires a good reason. And here comes the end!s

[.hide-footer]

Thanks!

[.hide-footer]

Get the slides from my github:

github.com/rberenguel/

The repository is

identity-graphs

[.hide-footer]

References
Connected Components in MapReduce and Beyond (ACM)
Connected Components in MapReduce and Beyond (slides)
Partition Aware Connected Component Computation in Distributed Systems
Building Graphs at a Large Scale: Union Find Shuffle
Adaptive Query Execution: Speeding up SparkSQL at runtime
Pregel: A System for Large-Scale Graph Processing
GraphX
GraphFrames
Apache Giraph
Neo4J
AWS Neptune
Databricks' Delta Lake: high on ACID

[.hide-footer]

Related talks
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Building Identity Graph at Scale for Programmatic Media Buying Using Apache Spark and Delta Lake
Building Identity Graphs over Heterogeneous Data
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x Performance Improvements
GraphFrames: Graph Queries In Spark SQL
Using GraphX/Pregel on Browsing History to Discover Purchase Intent

[.hide-footer]

Reference	Image attribution
Graphs	Ruben Berenguel 😎 (Generative art with p5js)
Bulb	Alessandro Bianchi (Unsplash)
Bubbles	Marko Blažević (Unsplash)
Chair	Volodymyr Tokar (Unsplash)
Cookie	Dex Ezekiel (Unsplash)
Loupe	Agence Olloweb (Unsplash)
Map	Timo Wielink (Unsplash)
Mask	Adnan Khan (Unsplash)
Newspaper	Rishabh Sharma (Unsplash)
Party	Adi Goldstein (Unsplash)
Socket	Kelly Sikkema (Unsplash)
Spray	JESHOOTS.COM (Unsplash)
Tuning	gustavo Campos (Unsplash)
Web	Shannon Potter (Unsplash)

[.hide-footer]

Resources
Unicode table

[.hide-footer]

EOF

Event logs with cookies provided in batch by data providers ↩
Event logs with cookies generated from our servers, via our pixels ↩
Like the Pregel API ↩

FilesExpand file tree

identity-graphs.md

Latest commit

History

identity-graphs.md

File metadata and controls

whoami

Programmatic adtech

Find users satisfying some criteria

To find them we need

Their browse and/or behaviour data 🖥 🛒

To find them we need

Their browse and/or behaviour data 🖥 🛒

To deliver for our clients we need

A way to show them ads 📣

Cookies

Are used to help websites

track events

and state

as users browse

There are two kind of cookies

First party (session, state…)

Third party (event tracking…)

We get browse data from users on the web from data providers1

We get browse data from users in the web from data providers

We get browse data from users browsing our client website2

How do we connect both data sources?

The identifiers we get from both sides are unrelated!

😱

Mapping servers

and

the mapping chain

The identity problem

Basic solution

Enter GraphFrames

Basic Spark graph framework: GraphX

It is message-propagation3, graph-parallel, low level

Basic Spark graph framework: GraphX

It is message-propagation (Pregel API) graph-parallel, low level

GraphFrames are to DataFrames as GraphX is to RDDs

Alternatives considered…

Connected components in big data

The Large Star - Small Star algorithm

Input should be formatted as a DataFrame of edges

Output layout

Impact of moving from an adhoc process to a graph process

Data cleanup

Invalid identifiers

Invalid identifiers

Like na or 0 or xyz

(or fraudulent calls to a mapping server)

Node pruning

Node pruning

To prevent huge components

In the cookie case, by expiring cookies not seen in N days

Component destruction

Component destruction

To limit component size artificially

If the data is fully clean we can assume no user has more than M identifiers

What is the fastest way to build a 2 billion nodes graph daily?

🤔

What is the fastest way to build a 2 billion nodes graph daily?

Not doing it

The easy way

Machine tuning

for large graphs

Go large and tune up

Go large and tune up

better to have few, large, machines

Go large and tune up

better to have few, large, machines

and give executors more memory than you'd think

Impact of Adaptive Query Execution (AQE)

AQE uses runtime statistics to help the Cost Based Optimizer (CBO) and speed up Spark

Impact of Adaptive Query Execution (AQE)

AQE uses runtime statistics to help the Cost Based Optimizer (CBO) and speed up Spark

Using Spark 3.x with AQE active has a 30-40% speed up

Further improvements

Thanks!

Footnotes

Are used to help websites　　

We get browse data from users on the web from data providers¹

We get browse data from users browsing our client website²

Enter `GraphFrames`

It is message-propagation³, graph-parallel, low level

Input should be formatted as a `DataFrame` of edges

In the cookie case, by expiring cookies not seen in `N` days

If the data is fully clean we can assume no user has more than `M` identifiers

What is the fastest way to build a 2 billion nodes graph daily?　

What is the fastest way to build a 　　　2 billion nodes graph daily?