Skip to content

Latest commit

 

History

History
1059 lines (587 loc) · 24 KB

File metadata and controls

1059 lines (587 loc) · 24 KB

footer:                         [.footer-style: #2F2F2F, alignment(right), line-height(1), text-scale(3.5), z-index(10000)]

[.hide-footer]


[.hide-footer]

fit

^ ...


[.hide-footer]

whoami

  • Ruben Berenguel (@berenguel)
  • PhD in Mathematics
  • Lead Data Engineer at Hybrid Theory
  • Preferred stack is Python, Go and Scala

fit

^ I have divided this presentation in 3 sections: first I will talk about adtech, cookies and the identity problem. Then I will explain how we can solve the problem using an identity graph, and finally how we can process this graph fast with Apache Spark


[.hide-footer]

Part 1: Set up
Adtech
What are cookies, really?
What is cookie mapping?
The identity problem

^ I will start by setting up the problem: how programmatic advertising uses cookie to create targeted ads, and the user-identity problem that derives from it


[.hide-footer]

[.build-lists: true]

left

Programmatic adtech

Find users satisfying some criteria

  • Visited pages of category ABC
  • Are interested in concept XYZ
  • Are likely to want to buy from our client RST

^ Note that the third bullet needs some kind of machine learning or using very smart humans. The final goal is that if you are going to see an ad, it better is a relevant one


[.hide-footer]

right

To find them we need

Their browse and/or behaviour data 🖥 🛒

 

 


[.hide-footer]

right

To find them we need

Their browse and/or behaviour data 🖥 🛒

To deliver for our clients we need

A way to show them ads 📣

^ All boils down to identifying specific users online. But what identifies a user online, so we can show an ad only to someone who is going to be interested in it? That's cookies 🍪🥠


left

[.hide-footer]

Cookies

Are used to help websites  

track events

and state

as users browse

^ Note that most cookies are going away in the next years due to privacy concerns and regulations.


[.hide-footer] right

There are two kind of cookies

First party (session, state…)

Third party (event tracking…)

^ Session/state would be your basket, whether you are logged in, etc. Event tracking ranges from advertising (or related) stuff to analytics (like Google)


[.footer: right]

fit

^ A user is browsing online


[.footer: right]

fit

^ A first party webserver is serving a webpage


[.footer: left]

fit

^ A _third party webserver is serving a pixel on the page


[.footer: left]

fit

^ This server sets a cookie on the user/browser combination


[.footer: left]

fit

^ Stamp!


[.footer: left]

fit

^ When the user goes to the login area…


[.footer: left]

fit

^ The first party server keeps track of that state by setting a cookie on the user/browser


[.footer: left]

fit

^ Stamp!s


[.footer: left]

fit

^ Cookies are associated to the domain that set them, and they are not accessible from others. So, the first party server knows nothing about the third party cookie (and conversely)


[.footer: left]

fit


[.hide-footer]

We get browse data from users on the web from data providers1

 

^ So, in the end we get large amounts of batch data of what users do


[.hide-footer]

We get browse data from users in the web from data providers

We get browse data from users browsing our client website2

^ And data from what users do on our clients websites, which we handle with our third party 🍪


[.hide-footer]

left

How do we connect both data sources?

The identifiers we get from both sides are unrelated!

😱

^ Those wall sockets look scared


[.hide-footer]

right

Mapping servers

and

the mapping chain

^ For advertising to be effective, we need to connect these two data sources, what happens on our clients' websites and what happens around the world


[.footer: right]

fit

^ In cookie mapping, a user is browsing


[.footer: right]

fit

^ A pixel fires, a cookie is set


[.footer: right]

fit


[.footer: right]

fit


[.footer: right]

fit

^ The destination server (mapping server) redirects to another server


[.footer: right]

fit

^ that sets a cookie


[.footer: right]

fit


[.footer: right]

fit

^ and calls back to the initial server, reporting back the identifier that has been set in the cookie


[.footer: left]

fit

^ This can repeat any number of times (although less is better)


[.footer: left]

fit


[.footer: left]

fit


[.footer: left]

fit


[.footer: left]

fit

^ This is what the chain looks like then: a chain of identifiers (or cookies) that are tied to a user. This is as seen from the server initiating the redirections


left

[.hide-footer]

The identity problem

^ 🥸 Greetings, good man. Might I trouble you for a drink? Homer? Who is Homer?


fit

^ These are a few chains. A virtual id for each chain is added when the redirections start and keep track of the callbacks. The identity problem appears when you try to keep the chains up to date as days pass: some cookies degrade fast, and a user may have several identifiers for each partner. Handling this adhoc results in mapping issues


[.hide-footer]

[.build-lists: true]

Basic solution

  • Coalesce (merge on nulls) chains based on one id
  • Is not as complete as the graph approach because…
  • Requires one stable identifier

^ (Or stable enough identifier). This solution can be applied to batches of chains without requiring any lookback. The coalescing either kills other identifiers (and requires a stable identifier) or results in an overwrite of identifiers


[.hide-footer]

fit

^ What do we do in this situation? Either id 77 goes with circle or with gamma. Unless…


[.hide-footer]

Part 2: The identity graph
Rethink the problem as a graph
Connected components in big data

[.footer: right]

fit

^ Recall the table with chains


[.footer: right]

fit

^ Think them as nodes in graphs


[.footer: right]

fit

^ Remove useless info


[.footer: right]

fit

^ Three connected components, three users


[.footer: right]

fit

^ Ignore useless sources that add no information


[.footer: right]

fit

^ This new information goes here


[.footer: right]

fit

^ And here


[.footer: right]

fit

^ With a coalescing solution, you would have 4 users, or best case scenario the system would resolve that user 42 is one o user 2 or user gamma.


[.footer: right]

fit


[.footer: right]

fit

^ By looking for connected components you realise there are actually 2 users instead of 3. How do we find connected components with Spark?


[.hide-footer]

Enter GraphFrames


[.hide-footer]

left

Basic Spark graph framework: GraphX

It is message-propagation3, graph-parallel, low level

 

^ The Pregel "message passing model" is very handy in its flexibility. It allows to create non-deterministic identity graphs as well, like the graph you could create to figure out cross-device identities (since cookies are set per browser)


[.hide-footer]

left

Basic Spark graph framework: GraphX

It is message-propagation (Pregel API) graph-parallel, low level

GraphFrames are to DataFrames as GraphX is to RDDs

^ But a higher level API is more convenient


[.hide-footer]

Alternatives considered…

Apache Giraph harder maintenance
Neo4J harder scalability
AWS Neptune too new

^ Except Giraphe, most options available are graph databases and not graph computation engines. The difference is important for our problem: we want to find connected components, not query. Graph databases are optimised for querying (and offer custom languages for it, like Gremlin)


[.hide-footer]

right

Connected components in big data

The Large Star - Small Star algorithm

^ The algorithm converts each connected component in a star (a cartwheel). There are several alternative algorithms that improve on large star - small star, like union-find-shuffle and partition-aware connected components


[.footer: left]

fit

^ Start with a graph, directed or undirected


[.footer: left]

fit

^ Randomly assign a different integer to all nodes. In GraphFrames this is done by adding a monotinically-increasing id to each node. Next step is a preparation step for humans, as the first step in large star


[.footer: left]

fit

^ First start with the Large Star step. This step is done for the local neighborhood of each node. To make it clearer, let's point from large to small first.


[.footer: left]

fit

^ The large star step is done per node, where we need to consider the immediate neighborhood. For example, let's check node 7


[.footer: left]

fit

^ It has two neighbors, 10 and 3. In this step, we connect all strictly larger neighbors (including self) to the minimum neighbor


[.footer: left]

fit

^ I.e. we connect 10 and 7 itself to 3


[.footer: left]

fit

^ This is done to all nodes. You can imagine water flowing down the slopes. 3 doesn't go to 1 because it's smaller than 9 for example.


[.footer: left]

fit

^ After the large star step, we come to the small star step.


[.footer: left]

fit

^ This is again a node-local algorithm. Let's focus on node 9


[.footer: left]

fit

^ and its neighbors.


[.footer: left]

fit

^ In this case, we need to connect the strictly smaller neighbors (including self) to the minimal neighbor. In this case, we connect 3 and 9 to 1.


[.footer: left]

fit

^ And likewise for all other nodes and its neighbors (in this case there are no additonal changes). All these node-local processes can be easily computed in a "SQL" way that can be parallelized by Spark


[.footer: left]

fit

^ Now we iterate, by applying Large Star again, which will link all neighbors of 3 (and 3) to 1.


[.footer: left]

fit

^ And we end up with a star, where all nodes are connected to a node with minimal id. We use this id as the connected component id. The algorithm is \mathcal{O}(\log^2\text{number of nodes}), although in practice it is significantly faster, because convergence depends on the height (or diameter) of the worst component. It can have horrible last reducer problems due to very large components in that case.


[.hide-footer]

Input should be formatted as a DataFrame of edges

src dst (…)
partner_1_𝟷 partner_2_ 1617963647…
partner_1_2 partner_3_ 1617963647…
partner_2_𝛄 partner_3_ 1617963654…

^ We can additionally pass any information related with an edge (generically call it label), most useful would be the timestamp of the event.


[.hide-footer]

Output layout

Component Id Partner / Cookie Id Timestamp
10234 partner_1_𝟷 1617963647
10234 partner_2_ 1617963647
5534 partner_1_2 1617963654

[.hide-footer]

[.build-lists: true]

To map from Partner A to Partner B

  • Given an id Partner_A_X,
  • we find the connected component id for the node Partner_A_X,
  • we find all the nodes of the form Partner_B_* for the component above

[.hide-footer]

[.build-lists: true]

Impact of moving from an adhoc process to a graph process

  • Partner integration: from 2 months to 1 week
  • Users mapped uplift: around 20%
  • Mapping "quality": competitive (within 5%) with industry leaders

[.hide-footer]

Part 3: Speed up and improvements
Data cleanup
Cheap refresh
Machine tuning
Potential improvement

[.hide-footer]

right

Data cleanup

^ There are several steps required as part of data cleaning for a graph computation like this one.


[.hide-footer]

Invalid identifiers

 


[.hide-footer]

Invalid identifiers

Like na or 0 or xyz

(or fraudulent calls to a mapping server)

^ You can analyze your graph data before doing anything and remove the most glaring invalid identifiers, but as your graph grows you'll find more and more edge cases to clean. Luckily, cleaning a graph is easy: you just destroy a component


[.hide-footer]

Node pruning

 


[.hide-footer]

Node pruning

To prevent huge components

In the cookie case, by expiring cookies not seen in N days

^ Any node you haven't seen in M days is basically useless in advertising (for some value of M) and we leverage this here to prevent having large components


[.hide-footer]

Component destruction

 

^ 🤘


[.hide-footer]

Component destruction

To limit component size artificially

If the data is fully clean we can assume no user has more than M identifiers

^ Welcome to the connected components, we've got fun and games. Destroying a component is the last resort, and only to be done for very large components, and sparingly


[.hide-footer]

What is the fastest way to build a 2 billion nodes graph daily? 

 

🤔


[.hide-footer]

What is the fastest way to build a    2 billion nodes graph daily?

Not doing it

^ 🥁


[.hide-footer]

right

The easy way


[.footer: left]

fit

^ We have an existing graph. We can assume it exists in some form. We have the chain data from a batch, maybe daily, maybe a few weeks or hours depending on your problem. We run the connected components algorithm on it


[.footer: left]

fit

^ And now we have two sets of stars, the existing ones and the new ones. But not all of them are alike


[.footer: left]

fit

^ Some have nodes in common between existing and new, some do not


[.footer: left]

fit


[.footer: left]

fit

^ We process them separately: those that have no nodes in common are clean, the others are tainted


[.footer: left]

fit

^ The clean ones are good to go, but for the tainted ones, we repeat the process of running large star - small star, with these new edges


[.footer: left]

fit


[.footer: left]

fit

^ And we end up with a (very large) consolidated graph


[.hide-footer]

left

Machine tuning

for large graphs

^ In Apache Spark of course


[.hide-footer]

[.build-lists: true]

left

Go large and tune up

  • the process is memory hungry
  • the process is shuffle hungry

[.hide-footer]

[.build-lists: false]

left

Go large and tune up

  • the process is memory hungry
  • the process is shuffle hungry

better to have few, large, machines

^ This is a rule of thumb: start with large machines, see how it behaves and what kind of query plans appear and then tune from there


[.hide-footer]

[.build-lists: false]

left

Go large and tune up

  • the process is memory hungry
  • the process is shuffle hungry

better to have few, large, machines

and give executors more memory than you'd think


[.hide-footer]

right

Impact of Adaptive Query Execution (AQE)

AQE uses runtime statistics to help the Cost Based Optimizer (CBO) and speed up Spark

 

^ The CBO tries to re-arrange queries depending on cost statistics, but needs to have updated information on all the tables. AQE keeps these up to date as the computations flow, feeding the CBO with fresh data


[.hide-footer]

right

Impact of Adaptive Query Execution (AQE)

AQE uses runtime statistics to help the Cost Based Optimizer (CBO) and speed up Spark

Using Spark 3.x with AQE active has a 30-40% speed up

^ This just requires you set a flag in your SparkConf (spark.sql.adaptive.enabled=true)


[.hide-footer]

[.build-lists: true]

Further improvements

  • Easy: Move storage to Delta Lake
  • Hard: implement union-find-shuffle instead of large star - small star

^ With Delta Lake we'd have the additional advantage of Z-ordering when executing certain joins, but should be a small win. UFS is supposed to be significantly faster than large star - small star, but implementing something like this requires a good reason. And here comes the end!s


[.hide-footer]

Thanks!


[.hide-footer]

right fit

Get the slides from my github:

github.com/rberenguel/

The repository is

identity-graphs



[.hide-footer]

References
Connected Components in MapReduce and Beyond (ACM)
Connected Components in MapReduce and Beyond (slides)
Partition Aware Connected Component Computation in Distributed Systems
Building Graphs at a Large Scale: Union Find Shuffle
Adaptive Query Execution: Speeding up SparkSQL at runtime
Pregel: A System for Large-Scale Graph Processing
GraphX
GraphFrames
Apache Giraph
Neo4J
AWS Neptune
Databricks' Delta Lake: high on ACID

[.hide-footer]

Related talks
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Building Identity Graph at Scale for Programmatic Media Buying Using Apache Spark and Delta Lake
Building Identity Graphs over Heterogeneous Data
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x Performance Improvements
GraphFrames: Graph Queries In Spark SQL
Using GraphX/Pregel on Browsing History to Discover Purchase Intent

[.hide-footer]

Reference Image attribution
Graphs Ruben Berenguel 😎 (Generative art with p5js)
Bulb Alessandro Bianchi (Unsplash)
Bubbles Marko Blažević (Unsplash)
Chair Volodymyr Tokar (Unsplash)
Cookie Dex Ezekiel (Unsplash)
Loupe Agence Olloweb (Unsplash)
Map Timo Wielink (Unsplash)
Mask Adnan Khan (Unsplash)
Newspaper Rishabh Sharma (Unsplash)
Party Adi Goldstein (Unsplash)
Socket Kelly Sikkema (Unsplash)
Spray JESHOOTS.COM (Unsplash)
Tuning gustavo Campos (Unsplash)
Web Shannon Potter (Unsplash)

[.hide-footer]

Resources
Unicode table

[.hide-footer]

EOF

Footnotes

  1. Event logs with cookies provided in batch by data providers

  2. Event logs with cookies generated from our servers, via our pixels

  3. Like the Pregel API