Add telemetry data catalog collection to wrangler deployments#12708
Add telemetry data catalog collection to wrangler deployments#12708dario-piotrowicz wants to merge 15 commits intomainfrom
Conversation
🦋 Changeset detectedLatest commit: 6cf4b27 The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
create-cloudflare
@cloudflare/kv-asset-handler
miniflare
@cloudflare/pages-shared
@cloudflare/unenv-preset
@cloudflare/vite-plugin
@cloudflare/vitest-pool-workers
@cloudflare/workers-editor-shared
wrangler
commit: |
e8f9344 to
3d335e5
Compare
3d335e5 to
7a7311a
Compare
There was a problem hiding this comment.
i think we have a telemetry.md file, can you add this to that?
There was a problem hiding this comment.
maybe.... but in regards to TELEMETRY_DATA_CATALOG_WORKER_URL I thought that maybe we could not surface this too much? it is mainly for us (for manual testing and unit testing). It would be part of the public API but users shouldn't really have a reason to mess with this (and if they want to disable this they could just do so by disabling all the wrangler telemetry). what do you think? 🙂
Regarding the other info, I checked and I think telemetry.md more or less generally included this type of information being collected, but I'll double check to see if that's actually the case 👍
There was a problem hiding this comment.
oh i don't mean the env var, i meant it would be good to include this list of what we collect in the telemetry.md file for transparency - dependency info (and account ids) is the big new one i guess
There was a problem hiding this comment.
Info added: 6b4c759
Please let me if you think this is good 🙂
| | "OPEN_NEXT_DEPLOY" | ||
|
|
||
| /** URL for the telemetry data catalog worker (defaults to `""` in vitest, the production data collector worker otherwise; falsy disables collection). */ | ||
| | "TELEMETRY_DATA_CATALOG_WORKER_URL"; |
There was a problem hiding this comment.
| | "TELEMETRY_DATA_CATALOG_WORKER_URL"; | |
| | "WRANGLER_TELEMETRY_WORKER_URL"; |
? to make it clear its for wrangler i guess
There was a problem hiding this comment.
I was actually trying to distinguish this from "standard telemetry"... if we're going to have two different types of telemetry it'd be nice if we could have name them differently.... 🤔
PS: DATA_CATALOG is also probably not a hugely helpful term here? 😅
There was a problem hiding this comment.
package catalog?
I do agree with Emily that we should count all of this stuff as the same currently. If we do break it up in the future we should do that more deliberately.
There was a problem hiding this comment.
package catalog?
That's what you sort of initially suggested I think, the issue with that for me is that we also include the bindings count info though 😓 (which seems a bit unrelated to packages to me?)
I do agree with Emily that we should count all of this stuff as the same currently. If we do break it up in the future we should do that more deliberately.
sorry, when I said "two different types of telemetry" I meant this new telemetry data collection vs amplitude, these are going to be differently right of the bat, right?
There was a problem hiding this comment.
for a user, i don't think we should make any distinction in different types of telemetry
but this env var is for internal use, and we might want to use it for stuff outside wrangler in the future, so idm whatever its called 🤷
There was a problem hiding this comment.
I don't have a super strong preference either 😅
I think it's unlikely for this to be used by other tools and not just wrangler... but it might happen? 🤷
I'm also happy with whatever name here 😄
|
Hey folks, as a heavy wrangler user and someone who is generally a fan of using data for informed decisions, including all of my private packages names and versions in your telemetry data feels a little heavy-handed to me. Telemetry data should be anonymous and Cloudflare's own telemetry docs call this out:
and
Can package names be considered truly anonymous data, or not similar to file names/paths? They often include uniquely identifying things like company names in npm namespaces via What exactly will this data be used for, and why do you need it? This is concerning, and likely to mean I disable telemetry on all of my projects moving forward and recommend others do the same. Perhaps a safer approach would be to send only specific framework detected versions, and/or specifically allowlisted packages/namespaces? |
.changeset/tidy-carrots-dance.md
Outdated
| Wrangler now sends anonymized usage data to a telemetry data catalog during `wrangler deploy`. This includes: | ||
|
|
||
| - Account ID and Worker name | ||
| - Wrangler version and package manager used | ||
| - Deployment timestamp | ||
| - Binding type counts (how many KV namespaces, R2 buckets, D1 databases, etc.) | ||
| - Project dependency information (package names and versions from package.json) |
There was a problem hiding this comment.
anonymized usage data
package names
These seem contradictory.
There was a problem hiding this comment.
anonymized usage data
Account ID and Worker name
This also. This can't really be called anonymous when you're tracking exactly who a user is, and what Worker they're touching.
|
Hey @Cherry @MattIPv4 , thanks for the feedback. I'm just consulting with some team members internally on how we word this in the telemetry file as we will have to change that more so than @dario-piotrowicz has helpfully suggested in this PR (which is awaiting product review). You are all of course welcome to disable telemetry on your projects. We do this in public and with these controls because we believe it should be the user's choice as to whether they want to make the trade off from an informed position. In this case we have unannounced features that we believe provide real value to users that are dependent on us collecting this data. This is data that we already have (it is attached as part of your bundle to us) but is not queryable currently. We'll be very specific about what we use user identity for in an upcoming addition to this PR that aligns |
|
👍 I don't particularly agree with collecting this data generally, but if you do, I agree that much more specific wording is needed in the telemtry documentation around this, and I think it should be very explicit that it is not anonymous -- you're collecting package names, which will undoubtedly identify certain folks/companies based on private packages, and you're also stating you're collecting the account id + worker name, which is also identifying. |
|
I think some thought should also be given to whether you release this in a minor release, enabled by default, or whether this needs to be opt-in and/or a major release -- the existing telemetry, as I understand it, is anonymous in nature, and with this change, you'd be suddenly collecting identifying telemetry. |
|
Thanks @MattieTK.
I think that's a fundamental difference from "telemetry" data. That's essentially project metadata you store directly tied to my account/worker, and wrapping that into telemetry (which should be anonymous) definitely needs some more thought and consideration with how this is presented and released as you noted.
Is this accurate? Apologies if I'm misunderstanding, but this bundle includes what lands in my worker, which with sourcemaps might include some of this information - but that's again something directly tied to my worker metadata, not telemetry for a CLI tool. That's not really telemetry, but stuff needed at runtime for my worker, logs, etc. And it's not simply every single package I have installed in this project with their versions. As an aside, other common tooling has a dedicated page about specifically what is collected, and how it's used such as the following, and it'd be great to see Cloudflare have the same. |
Our opinion here is that this is data that we could collect server-side on deployment anyway and we are choosing to do this locally to provide you with more control and enable those that are already opting out of data collection to also opt out of this in the same command. All other telemetry captured about your usage (which is entirely separate, goes into a different system, and is not linked to your account at all) is not changing as part of this. |
|
The data you can collect server-side is for packages I bundle into my worker, absolutely. You'd get that from my bundle output, sourcemaps, etc. - that's a product feature, and knowing what is running on your edge via project metadata is an absolutely reasonable thing to do, at least for common framework versions, etc. That would not include all of my private packages sitting in my project for other non-worker related things, etc. though. And that's the fundamental difference here. You're essentially wrapping a product metadata feature behind "anonymous" telemetry data in a CLI tool which is not where it belongs, and has not been communicated up to this point would be part of the telemetry data that is very much no longer anonymous. If this is a fundamentally different system, which is sounds like via this:
then I feel like this would be better served and communicated as a separate thing outside of just "telemetry", with it's own opt-in. |
|
👀 I appreciate the pushed change to attempt to filter out private packages, but I don't think that approach will really work, and I don't think it is good enough to then assert in the documentation that you're only capturing public packages. Nothing about the logic change pushed up actually ensures the package is public, all it appears to be doing is looking for the I'd suggest instead of the approach that was just pushed up, you make an API call to the NPM registry for each package, and only track if it the package is publicly available on NPM, and ideally is above a certain downloads/month threshold (as to not store personal/company packages that are "public" but would still potentially identify a specific project due to limited package usage). |
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
bb4a1c4 to
2af4c3d
Compare
|
Codeowners approval required for this PR:
Show detailed file reviewers
|
|
🎉 Thank you! Only tracking public packages on NPM w/ over 10k downloads a year makes me a lot less resistant/nervous regarding this change, and it seems like it'd be unlikely to break the existing anonymous data statement. Still definitely needs very explicit documentation about what is being tracked now, and if account ID + name are still being collected as part of this same dataset, I would think that breaks the anonymity of it all. |
No worries, I was about to post a comment but you beat me to it 😄 Thanks for the suggestion, you were right that projects could have packages in their own registries, so it makes sense to me to check the npm registry 👍
Yeah account ID + name are still being collected, I guess you think the changes in |
Indeed. The document's first sentence is The new bullet on public dependencies could probably also be updated w/ your latest change to clarify that you're only tracking public dependencies on the NPM registry w/ >10k/yr downloads. |
|
This feels a lot better now, thank you! I might recommend increasing the threshold here by an order of magnitude to 100k though, which over the course of a year is much harder to hit (for anything that isn't used enough to be considered for "anonymous" reporting) than 10k. Even a small package we maintain and include in most projects, CI, etc. hits >13k downloads a year at https://www.npmjs.com/package/@nodecraft/eslint-config and would identify my team and projects pretty quickly. |
Thanks @Cherry 🙂, I'll check with the team if that's possible 🙂 The problem with your suggestion is that new packages can be created and might want to track their usage right away (for example we'd be interested here to track the use of our new |
|
I think in the case of vinext, the stats there are just lagging a bit due to the fresh release. It had ~12k downloads this month in just the last 7 days (https://www.npmcharts.com/compare/vinext) so will easily cross 100k/year soon. An allowlist of all cloudflare-owned packages also seems like a reasonable approach to me. |
Yes, of course we could handle (Well if the package is not owned by us and hasn't gone super viral maybe we wouldn't have that huge drive to start tracking it?) |
Better @MattIPv4? 😄 |
Is all the data being collected as a single dataset, and thus every item of data collected is associated with the account id + worker name? If so, I don't think that wording is sufficient, as the entire dataset being collected is now tied to the identity of the account/worker. If not, and the account id + worker name is being collected independently, then it is fine -- but at that point, I'm not sure why you'd collect them anyway, unless you wanted to do analysis on patterns in worker names or something. (Wording change on which packages are captured looks good 👍) I do also still think this should be a breaking change and/or opt-in, as I'm sure is clear now based on this updated wording, you're going from non-identifying telemetry to identifying telemetry... |
|
I feel like this might also need to go via y'all's legal department -- you're actively collecting identifying information now, so how would a user go about exercising the rights I suspect they'd now have with this to request that their identifying data be deleted, or request a copy of it? I imagine that should also be clarified in the telemetry document itself? |
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
|
Hey folks, After some discussion this morning, we've decided to remove the npm integration as we're concerned that in CI this might cause invisible failures if the npm api is rate limited given the number of builds we perform. This PR is still in progress and we'll be making further modifications to the messaging presented to the user on a new version, and the telemetry documentation, including any legal contact requirements. At this point if you don't want to be included in this data then you are welcome to opt out using the standard telemetry opt out. We will honour this moving forward and will intentionally opt out users from this feature and features like it if they have specified that they want to opt out of telemetry generally. We also recognise that you see this as different from fully anonymous local telemetry, which it is, and in the future we may offer a separate opt out for this feature. We'll use information from the first few weeks of capture to further tune this and hopefully reintroduce or come up with other alternatives for some of the filters suggested here. We also don't want to capture anything that doesn't work to the goals of the team here, which is to be able to make a better Workers product at Cloudflare. |
|
That is very unfortunate to hear -- what had been pushed up was a good solution to finding a balance between knowing about major packages to warn users regarding vulnerabilities, while avoiding capturing identifying private packages... and now you're back to capturing all packages, including those from private registries etc. If this is telemetry, why not just gracefully fail to capture package information if the NPM calls fail? And if this is a product offering instead, do it in the product instead of pretending it is just telemetry for an open-source CLI tool. I cannot possibly see how capturing private packages works towards the stated goal of letting users know about vulnerabilities, and it smells to me like there is some other unstated reason for gathering this information.
I find these two sentences from your most recent push completely incongruent. You cannot start the file by stating you do not collect sensitive information, and then immediatley walk that back by stating you do collect sensitive information (account id, worker name, dependencies including those that could be private and identify companies).
These two sentences also seem completely incongruent... you'll only use it for aggregate analysis, except you won't, you're also storing and going to use it to identify individual accounts... If you're walking back on the NPM logic that was added here, this document should make it absolutely crystal clear to users that their private dependencies will be captured and stored in a way that identifies them specifically. Again, can't see why you'd need to do that though unless you wanted to identify specific companies/users... |
Partially fixes https://jira.cfdata.org/browse/DEVX-2376 and https://jira.cfdata.org/browse/DEVX-2013
The changes in this PR send various details via a POST request regarding a deployment to
https://telemetry-data-catalog-collector.devprod.workers.devfor a new telemetry collection that will improve our observability on what's currently deployed.Note
The worker at
https://telemetry-data-catalog-collector.devprod.workers.devis not yet implemented, it will be soon, the worker not being implemented is not a problem since wrangler will not error in case the POST request failsA picture of a cute animal (not mandatory, but encouraged)