Hacker News new | comments | show | ask | jobs | submit login
Machine learning without centralized training data (googleblog.com)
91 points by nealmueller 3 hours ago | hide | past | web | 24 comments | favorite





This is one of those announcements that seems unremarkable on read-through but could be industry-changing in a decade. The driving force between consolidation & monopoly in the tech industry is that bigger firms with more data have an advantage over smaller firms because they can deliver features (often using machine-learning) that users want and small startups or individuals simply cannot implement. This, in theory, provides a way for users to maintain control of their data while granting permission for machine-learning algorithms to inspect it and "phone home" with an improved model, without revealing the individual data. Couple it with a P2P protocol and a good on-device UI platform and you could in theory construct something similar to the WWW, with data stored locally, but with all the convenience features of centralized cloud-based servers.

I think you could be spot on: there are new applications emerging in deep learning, like self-driving vehicles, where you have a powerful need for mountains of data to train complex models, yet a logistics problem in how to aggregate that data in a single place (I'm making this up, but imagine a car collecting 7 streams of 4K video at 60 fps). I really see a growing need for these types of distributed training models.

Which is why I am surprised seeing this come from Google. Everyone has already admitted they are fine sending all of their data to them which benefits them greatly.

I think this may be a Xerox Alto, IBM PC, or Sun Java moment. In the short term I can see a clear benefit to Google for this. They want to get machine-learning into more aspects of the Android mobile experience, Android customers are justifiably paranoid of sending things like every keystroke on the device back to Google's cloud, and so this gives them a privacy-acceptable way to deliver the features that will make them more competitive in a new market. Remember that the vast majority of Google employees honestly want to do what's best for the user, not what preserves Google's monopoly.

The vast majority of people in any organisation are good people who want what's best in the sense of the greater good - that does not prevent organisations from doing bad things.

Google's internal training emphasizes to do the right thing and compete fairly%, going so far as to not use terms in PR or even internal email such as 'crush the competition' 'dominate' 'destroy', and always doing what's good for the user, rather than bad for the competition.

% and often mentions competition/monopoly laws


Which is already the case like for 10 years. That is why no Startup ever challenge Google for their Search. It is basically the Matthew Effect but on Web Scale.

This is fascinating, and makes a lot of sense. There aren't too many companies in the world that could pull something like this off.. amazing work.

Counterpoint: perhaps they don't need your data if they already have the model that describes you!

If the data is like oil, but the algorithm is like gold.. then they still extract the gold without extracting the oil. You're still giving it away in exchange for the use of their service.

For that matter, run the model in reverse, and while you might not get the exact data... we've seen that machine learning has the ability to generate something that simulates the original input...


Reminds me of a talk I saw by Stephen Boyd from Stanford a few years ago: https://www.youtube.com/watch?v=wqy-og_7SLs

(Slides only here: https://www.slideshare.net/0xdata/h2o-world-consensus-optimi...)

At that time I was working at a healthcare startup, and the ramifications of consensus algorithms blew my mind, especially given the constraints of HIPAA. This could be massive within the medical space, being able to train an algorithm with data from everyone, while still preserving privacy.


"Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud."

So I assume this would help with privacy in a sense that you can train model on user data without transmitting it to the server. Is this in any way similar to something Apple calls 'Differential Privacy' [0] ?

"The key idea is to use the powerful processors in modern mobile devices to compute higher quality updates than simple gradient steps."

"Careful scheduling ensures training happens only when the device is idle, plugged in, and on a free wireless connection, so there is no impact on the phone's performance."

It's crazy what the phones of near future will be doing while 'idle'.

------------------------

[0] https://www.wired.com/2016/06/apples-differential-privacy-co...


While I think you can definitely draw some parallels, differential privacy seems more targeted at metric collection. You have to be able to mutate the data in a way that it becomes non-identifying, without corrupting the answer in aggregate. Apple would still do all their training in the cloud.

In contrast, what Google's proposing is more like distributed training. In regular SGD, you'd iterate over a bunch of tiny batches, sequentially through your whole training set. Sounds like Google's saying each device becomes it's own mini-batch, and it beams up the result, and Google will average them all out in a smart way (I didn't read the paper, but this was the gist I got from the article).

Both ideas are in the same spirit, just the implementations are very different.


Differential Privacy is much more than what Apple's PR department says, differentially private SGD is already a thing.

This is different from differential privacy (which, btw, isn't just an apple thing). Differential privacy essentially says some responses will be lies, but that we can still get truthful aggregate information. The canonical example is the following process: Flip a coin, if it's head, tell me whether you're a communist. If it's tails, flip another coin and if that one comes up heads, tell me you're a communist, and if it's tails, tell me you're not.

From one run, you can't tell if any individual is telling the truth, but you can still estimate the number of communists from the aggregate responses.

This is doing local model training, and sending the model updates, instead of the raw data that would usually be used for training.


Here's google doing this in November 2015:

http://download.tensorflow.org/paper/whitepaper2015.pdf


Chrome used differential privacy far before Apple. See the RAPPOR paper.

This is speculative, but it seems like the privacy aspect is oversold as it may be possible to reverse engineer the input data from the model updates. The point is that the model updates themselves are specific to each user.

I would not be surprised if the specific contents of what you wrote was not accessible but I would expect that the general theme would be captured. The weights that are updated would correspond to neurons that would correspond to specific themes/subjects.

Well you obviously can't fully reverse engineer, since whatever model update being sent is far far smaller than the overall data. Now, could you theoretically extract "some" data? Maybe, but it still is strictly better than sending all of the data.

Even if this only allowed device based training and not privacy advantages it's exciting as a way of compression. Rather than sucking up device upload bandwidth you keep the data local and send the tiny model weight delta!

Huge implications for distributed self-driving car training and improvement.

I think the implications go even beyond privacy and efficiency. One could estimate each user's contribution to fidelity gains of the model. At least as an average within a batch. I imagine such an attribution to rewarded in money or credibility in the future.

While a neat architectural improvement, the cynic in me thinks this is a fig leaf for the voracious inhalation of your digital life they're already doing.

Particularly relevant: "Federated Learning allows for smarter models ... all while ensuring privacy". Reading the paper, Google would still receive model updates, so this statement seems based on assumption that you can't learn anything meaningful about me based on those higher level features (which are far reduced in dimensionality from the raw data). I'm curious how they back up that argument.

This is literally non-stochastic gradient descent where the batch update simply comes from a single node and a correlated set of examples. Nothing mind-blowing about it.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: