Add client identifier header to HTTP requests#1075
Conversation
Creates an anonymized repository identifier based on the SHA of the first commit, hashed with SHA256 for additional privacy. The identifier is sent as an `x-hex-client-id` header when available.
3839199 to
95452f7
Compare
The new variable allows users to opt-out of identification by setting the value to anything other than `1` or `true`. It remains enabled by default.
lib/hex/utils.ex
Outdated
| - The current directory isn't within a git repository | ||
| """ | ||
| def repo_identifier do | ||
| with :unset <- Process.get(:hex_repo_identifier, :unset), |
There was a problem hiding this comment.
It's not clear how useful this will be in practice, as I imagine fetching deps is parallelized across multiple processes. However, the codebase appears to support back to Elixir v1.6, hence OTP 19, which is before persistent_term was available and we can't cache between processes very easily.
What does everybody think? Is this useful? Is it worth looking at ETS for caching?
There was a problem hiding this comment.
could you look into using Hex.State (an agent) for storage? We even have some affordances there for env variables but not sure if they'd help in this particular case, i.e. I don't think we need to be able to set a custom repo id via env.
There was a problem hiding this comment.
Thanks for the tip. I switched to Hex.State to check whether the identifier is enabled or disabled, as well as caching the value, which is far more convenient.
Do you think they should be combined into a single repo_identifier value that's either nil, false, or the cached binary?
c8c4cf8 to
4e1cc84
Compare
Prevents fetching the same identifier on every client call by caching the value in the current process dictionary. While the git command isn't particularly slow, this should avoid spawning during repeated http calls.
4e1cc84 to
ffb5c02
Compare
Switch from a raw `System.get_env` to using the normalized state agent. Also prevent STDERR from leaking when called outside of a git repository.
15ea8ff to
100e500
Compare
dd07cd7 to
34d84bc
Compare
|
I'd love to see this make it in, would be very impactful information for a lot of libraries and frameworks. Anything I can do to help? |
ericmj
left a comment
There was a problem hiding this comment.
Is this considered PII and how do we need to inform users about this change other than a note in the changelog?
Other tracking we've talked about adding is:
- Session ID header, which would be a unique ID for each
mix deps.get - A header if the env var
CIis set, to help identity real users vs CI runs
The repo identifier value is now cached with an agent, which also serializes access to fetch the state. This also removes the value from `Hex.State`, which was previously used for caching but is ultimately unnecessary.
|
@ericmj Took me a while, but I've addressed both of your requests. |
|
Thank you @sorentwo 💜 |
|
|
||
| Returns `nil` when: | ||
|
|
||
| - The `HEX_REPO_IDENTIFIER` environment variable is set to anything other `1` or `true` |
There was a problem hiding this comment.
Maybe it should be “ HEX_NO_REPO_IDENTIFIER”
?
There was a problem hiding this comment.
Either version is fine by me (HEX_REPO_IDENTIFIER=false or HEX_NO_REPO_IDENTIFIER=true)
There was a problem hiding this comment.
I mean, the doc here says: HEX_REPO_IDENTIFIER
But the code is actually using NO_HEX_REPO_IDENTIFIER:
So maybe the docs should be changed to NO_HEX_REPO_IDENTIFIER.
Or, I'm missing something. 😅
There was a problem hiding this comment.
Oh, it's already fixed: https://github.com/hexpm/hex/blob/68758fbffda1069fda61fa503b83844b39513fbe/lib/hex/repo_identifier.ex#L10C10-L10C32
I'm sorry. 😅
There was a problem hiding this comment.
My mistake! This PR drifted a while and I completely forgot about the name change. Glad @ericmj took care of it =)
|
We need to ask for consent before sending this telemetry according to GDPR because it creates a persistent identifier and is not required for the functionality of Hex. Since it has limited usefulness when it's opt-in I am inclined to remove it. cc @sorentwo |
@ericmj I’m not sure this identifier would fall under GDPR. It’s a git repo–scoped value, not tied to any user or machine. Different people or machines using the same repo would generate the same identifier, reinforcing that it’s linked to the project/git repo, not to a person. I guess GDPR only applies to data that can identify a natural person, not to identifiers of software artifacts like a Git repository. If it were a machine-specific identifier (like a device ID, IP, or cookie), then it could make a person indirectly identifiable — that’s when GDPR applies. But this one isn’t, so it should be outside GDPR’s scope. But I'm not an expert on GDPR. Only saying that based on my research and personal experience. |
|
@hugobarauna I see your point but it's enough of an edge case where I think the risk is not worth it. In most cases a repository is not identifiable to a single person but I could imagine cases where it could be argued that a repository identifier is tied to a single person. For example if you start working on a new project for a month before publishing it to github under your username, any |
Creates an anonymized repository identifier based on the SHA of the first commit, hashed with SHA256 for additional privacy. The identifier is sent as an
x-hex-client-idheader when available.For example, in the
hexrepo itself:The value can then be used by hex.pm and private hex servers to more accurately count requests. That will give a clearer picture of how many unique applications are using a particular package, rather than just the raw number of downloads.
Some followup steps if this approach is accepted:
/cc @wojtekmach