footer:
[.footer-style: #2F2F2F, alignment(right), line-height(1), text-scale(3.5), z-index(10000)]
[.hide-footer]
[.hide-footer]
^ ...
[.hide-footer]
- Ruben Berenguel (@berenguel)
- PhD in Mathematics
- Lead Data Engineer at Hybrid Theory
- Preferred stack is Python, Go and Scala
^ I have divided this presentation in 3 sections: first I will talk about adtech, cookies and the identity problem. Then I will explain how we can solve the problem using an identity graph, and finally how we can process this graph fast with Apache Spark
[.hide-footer]
| Part 1: Set up |
|---|
| Adtech |
| What are cookies, really? |
| What is cookie mapping? |
| The identity problem |
^ I will start by setting up the problem: how programmatic advertising uses cookie to create targeted ads, and the user-identity problem that derives from it
[.hide-footer]
[.build-lists: true]
- Visited pages of category
ABC - Are interested in concept
XYZ - Are likely to want to buy from our client
RST
^ Note that the third bullet needs some kind of machine learning or using very smart humans. The final goal is that if you are going to see an ad, it better is a relevant one
[.hide-footer]
[.hide-footer]
^ All boils down to identifying specific users online. But what identifies a user online, so we can show an ad only to someone who is going to be interested in it? That's cookies 🍪🥠
[.hide-footer]
^ Note that most cookies are going away in the next years due to privacy concerns and regulations.
^ Session/state would be your basket, whether you are logged in, etc. Event tracking ranges from advertising (or related) stuff to analytics (like Google)
^ A user is browsing online
^ A first party webserver is serving a webpage
^ A _third party webserver is serving a pixel on the page
^ This server sets a cookie on the user/browser combination
^ Stamp!
^ When the user goes to the login area…
^ The first party server keeps track of that state by setting a cookie on the user/browser
^ Stamp!s
^ Cookies are associated to the domain that set them, and they are not accessible from others. So, the first party server knows nothing about the third party cookie (and conversely)
[.hide-footer]
We get browse data from users on the web from data providers1
^ So, in the end we get large amounts of batch data of what users do
[.hide-footer]
We get browse data from users browsing our client website2
^ And data from what users do on our clients websites, which we handle with our third party 🍪
[.hide-footer]
^ Those wall sockets look scared
[.hide-footer]
^ For advertising to be effective, we need to connect these two data sources, what happens on our clients' websites and what happens around the world
^ In cookie mapping, a user is browsing
^ A pixel fires, a cookie is set
^ The destination server (mapping server) redirects to another server
^ that sets a cookie
^ and calls back to the initial server, reporting back the identifier that has been set in the cookie
^ This can repeat any number of times (although less is better)
^ This is what the chain looks like then: a chain of identifiers (or cookies) that are tied to a user. This is as seen from the server initiating the redirections
[.hide-footer]
^ 🥸 Greetings, good man. Might I trouble you for a drink? Homer? Who is Homer?
^ These are a few chains. A virtual id for each chain is added when the redirections start and keep track of the callbacks. The identity problem appears when you try to keep the chains up to date as days pass: some cookies degrade fast, and a user may have several identifiers for each partner. Handling this adhoc results in mapping issues
[.hide-footer]
[.build-lists: true]
- Coalesce (merge on nulls) chains based on one id
- Is not as complete as the graph approach because…
- Requires one stable identifier
^ (Or stable enough identifier). This solution can be applied to batches of chains without requiring any lookback. The coalescing either kills other identifiers (and requires a stable identifier) or results in an overwrite of identifiers
[.hide-footer]
^ What do we do in this situation? Either id 77 goes with circle or with gamma. Unless…
[.hide-footer]
| Part 2: The identity graph |
|---|
| Rethink the problem as a graph |
| Connected components in big data |
^ Recall the table with chains
^ Think them as nodes in graphs
^ Remove useless info
^ Three connected components, three users
^ Ignore useless sources that add no information
^ This new information goes here
^ And here
^ With a coalescing solution, you would have 4 users, or best case scenario the system would resolve that user 42 is one o user 2 or user gamma.
^ By looking for connected components you realise there are actually 2 users instead of 3. How do we find connected components with Spark?
[.hide-footer]
[.hide-footer]
It is message-propagation3, graph-parallel, low level
^ The Pregel "message passing model" is very handy in its flexibility. It allows to create non-deterministic identity graphs as well, like the graph you could create to figure out cross-device identities (since cookies are set per browser)
[.hide-footer]
^ But a higher level API is more convenient
[.hide-footer]
| Apache Giraph | harder maintenance |
| Neo4J | harder scalability |
| AWS Neptune | too new |
^ Except Giraphe, most options available are graph databases and not graph computation engines. The difference is important for our problem: we want to find connected components, not query. Graph databases are optimised for querying (and offer custom languages for it, like Gremlin)
[.hide-footer]
^ The algorithm converts each connected component in a star (a cartwheel). There are several alternative algorithms that improve on large star - small star, like union-find-shuffle and partition-aware connected components
^ Start with a graph, directed or undirected
^ Randomly assign a different integer to all nodes. In GraphFrames this is done by adding a monotinically-increasing id to each node. Next step is a preparation step for humans, as the first step in large star
^ First start with the Large Star step. This step is done for the local neighborhood of each node. To make it clearer, let's point from large to small first.
^ The large star step is done per node, where we need to consider the immediate neighborhood. For example, let's check node 7
^ It has two neighbors, 10 and 3. In this step, we connect all strictly larger neighbors (including self) to the minimum neighbor
^ I.e. we connect 10 and 7 itself to 3
^ This is done to all nodes. You can imagine water flowing down the slopes. 3 doesn't go to 1 because it's smaller than 9 for example.
^ After the large star step, we come to the small star step.
^ This is again a node-local algorithm. Let's focus on node 9
^ and its neighbors.
^ In this case, we need to connect the strictly smaller neighbors (including self) to the minimal neighbor. In this case, we connect 3 and 9 to 1.
^ And likewise for all other nodes and its neighbors (in this case there are no additonal changes). All these node-local processes can be easily computed in a "SQL" way that can be parallelized by Spark
^ Now we iterate, by applying Large Star again, which will link all neighbors of 3 (and 3) to 1.
^ And we end up with a star, where all nodes are connected to a node with minimal id. We use this id as the connected component id. The algorithm is \mathcal{O}(\log^2\text{number of nodes}), although in practice it is significantly faster, because convergence depends on the height (or diameter) of the worst component. It can have horrible last reducer problems due to very large components in that case.
[.hide-footer]
| src | dst | (…) |
|---|---|---|
| partner_1_𝟷 | partner_2_⍺ | 1617963647… |
| partner_1_2 | partner_3_⭘ | 1617963647… |
| partner_2_𝛄 | partner_3_△ | 1617963654… |
| ⁞ | ⁞ | ⁞ |
^ We can additionally pass any information related with an edge (generically call it label), most useful would be the timestamp of the event.
[.hide-footer]
| Component Id | Partner / Cookie Id | Timestamp |
|---|---|---|
| 10234 | partner_1_𝟷 | 1617963647 |
| 10234 | partner_2_⍺ | 1617963647 |
| 5534 | partner_1_2 | 1617963654 |
| ⁞ | ⁞ | ⁞ |
[.hide-footer]
[.build-lists: true]
To map from Partner A to Partner B
- Given an id Partner_A_X,
- we find the connected component id for the node Partner_A_X,
- we find all the nodes of the form Partner_B_* for the component above
[.hide-footer]
[.build-lists: true]
- Partner integration: from 2 months to 1 week
- Users mapped uplift: around 20%
- Mapping "quality": competitive (within 5%) with industry leaders
[.hide-footer]
| Part 3: Speed up and improvements |
|---|
| Data cleanup |
| Cheap refresh |
| Machine tuning |
| Potential improvement |
[.hide-footer]
^ There are several steps required as part of data cleaning for a graph computation like this one.
[.hide-footer]
[.hide-footer]
^ You can analyze your graph data before doing anything and remove the most glaring invalid identifiers, but as your graph grows you'll find more and more edge cases to clean. Luckily, cleaning a graph is easy: you just destroy a component
[.hide-footer]
[.hide-footer]
^ Any node you haven't seen in M days is basically useless in advertising (for some value of M) and we leverage this here to prevent having large components
[.hide-footer]
^ 🤘
[.hide-footer]
^ Welcome to the connected components, we've got fun and games. Destroying a component is the last resort, and only to be done for very large components, and sparingly
[.hide-footer]
[.hide-footer]
^ 🥁
[.hide-footer]
^ We have an existing graph. We can assume it exists in some form. We have the chain data from a batch, maybe daily, maybe a few weeks or hours depending on your problem. We run the connected components algorithm on it
^ And now we have two sets of stars, the existing ones and the new ones. But not all of them are alike
^ Some have nodes in common between existing and new, some do not
^ We process them separately: those that have no nodes in common are clean, the others are tainted
^ The clean ones are good to go, but for the tainted ones, we repeat the process of running large star - small star, with these new edges
^ And we end up with a (very large) consolidated graph
[.hide-footer]
^ In Apache Spark of course
[.hide-footer]
[.build-lists: true]
- the process is memory hungry
- the process is shuffle hungry
[.hide-footer]
[.build-lists: false]
- the process is memory hungry
- the process is shuffle hungry
^ This is a rule of thumb: start with large machines, see how it behaves and what kind of query plans appear and then tune from there
[.hide-footer]
[.build-lists: false]
- the process is memory hungry
- the process is shuffle hungry
[.hide-footer]
^ The CBO tries to re-arrange queries depending on cost statistics, but needs to have updated information on all the tables. AQE keeps these up to date as the computations flow, feeding the CBO with fresh data
[.hide-footer]
^ This just requires you set a flag in your SparkConf (spark.sql.adaptive.enabled=true)
[.hide-footer]
[.build-lists: true]
- Easy: Move storage to Delta Lake
- Hard: implement union-find-shuffle instead of large star - small star
^ With Delta Lake we'd have the additional advantage of Z-ordering when executing certain joins, but should be a small win. UFS is supposed to be significantly faster than large star - small star, but implementing something like this requires a good reason. And here comes the end!s
[.hide-footer]
[.hide-footer]
Get the slides from my github:
github.com/rberenguel/
The repository is
identity-graphs
[.hide-footer]
[.hide-footer]
[.hide-footer]
| Reference | Image attribution |
|---|---|
| Graphs | Ruben Berenguel 😎 (Generative art with p5js) |
| Bulb | Alessandro Bianchi (Unsplash) |
| Bubbles | Marko Blažević (Unsplash) |
| Chair | Volodymyr Tokar (Unsplash) |
| Cookie | Dex Ezekiel (Unsplash) |
| Loupe | Agence Olloweb (Unsplash) |
| Map | Timo Wielink (Unsplash) |
| Mask | Adnan Khan (Unsplash) |
| Newspaper | Rishabh Sharma (Unsplash) |
| Party | Adi Goldstein (Unsplash) |
| Socket | Kelly Sikkema (Unsplash) |
| Spray | JESHOOTS.COM (Unsplash) |
| Tuning | gustavo Campos (Unsplash) |
| Web | Shannon Potter (Unsplash) |
[.hide-footer]
| Resources |
|---|
| Unicode table |
[.hide-footer]
EOF















































































