@FaceDeer

FaceDeer

It is impossible for them to contain more than just random fragments, the models are too small for it to be compressed enough to fit. Even the fragments that have been found are not exact, the AI is “lossy” and hallucinates.

The examples that have been found are examples of overfitting, a flaw in training where the same data gets fed into the training process hundreds or thousands of time over. This is something that modern AI training goes to great lengths to avoid.

FaceDeer

You could say it’s to “circumvent” the law or you could say it’s to comply with the law. As long as the PII is gone what’s the problem?

FaceDeer

The GDPR says that information that has been anonymized, for example through statistical analysis, is fine. LLM training is essentially a form of statistical analysis. There’s hardly anything in law that is “simple.”

FaceDeer

You don’t think LLMs are being trained off of this content too? Nobody needs to bother “announcing a deal” for it, it’s being freely broadcast.

FaceDeer

Maybe it’s “simple as that” if you’re just expressing an opinion, but what’s the legal basis for it?

FaceDeer

The analogy isn’t perfect, no analogy ever is.

In this case the content of the search is all that really matters for the quality of the search. What else would you suggest be recorded, the words-per-minute typing speed, the font size? If they want to improve the search system they need to know how it’s working, and that involves recording the searches.

It’s anonymized and you can opt out. Go ahead and opt out. There’ll still be enough telemetry for them to do their work.

FaceDeer

No, this analogy would make more sense if it was a matter of recording a large number of interactions between customers and tellers to ensure that the window isn’t interfering with their interactions. Is the window the right size? Can the customer and teller hear each other through it? Is that little hole at the bottom large enough to let through the things they need to physically exchange? If you deploy the windows and then never gather any telemetry you have no idea whether it’s working well or if it could be improved.

FaceDeer

Buddy, I just want to type a search term and get results.

Telemetry can help them do better at providing that. Devs aren’t magical beings, they don’t know what’s working and what’s not unless someone tells them.

FaceDeer

Fortunately it doesn’t have to be exactly like the real thing to be useful. Just ask machine learning scientists.

FaceDeer

And that’s what Microsoft has apparently done in this case, yet it’s being spun negatively anyway.

FaceDeer

So, Microsoft recognized and responded to all the complaints by removing the feature that people were objecting to.

Resulting headline: “Microsoft is trying to hide the evidence that they were thinking of doing that thing we hated! Hate them harder!”

Do people want companies to just ignore complaints completely because there’s no way to satisfy anyone anyway?

FaceDeer

This is a bill that’s only just now being put up for signature.

FaceDeer

If you want a music artist that is guaranteed to not come out in support of some political view, you should probably pick a dead one.

Alternately, you could sign up to udio.com or suno.com and generate some of your own. Since it’s an AI making the music you can be sure it holds no opinions.

FaceDeer

People find a shiny new word that sounds clever. They start using it so that they sound clever too. They like sounding clever, so they use it a lot. They start using it for things that it doesn’t actually mean, until it loses all meaning other than the most generic “I don’t like that.” The word becomes enshittified and people eventually stop using it.

Needing a substitute, people find a new word…

FaceDeer

Too late, there’s a little blood in the water so now everyone hates Microsoft and is pouncing on every drop they think they smell. Being part of an angry mob is fun!

Don’t worry, in a couple of weeks or months some other big company or rich person will become the focus and everyone will forget about Microsoft again.

FaceDeer

Indeed, which is why I’m furious at the Internet Archive’s leadership for merrily dancing out into a minefield completely unbidden.

FaceDeer

I’ve been saying this for years, this was an incredibly boneheaded move by the Internet Archive and they just keep on doubling down on it. They shouldn’t have done it in the first place. When they got sued, they should have immediately admitted they screwed up and settled - the publishers would probably have been fine with a token punishment and a promise to shut down their ebook library, it’s not like IA cost them anything significant. But they just keep on fighting, and it’s only making things worse.

This isn’t even IA’s purpose in the first place! They archive the Internet. They’re like a guy who’s caring for a precious baby who decides he should go poke a bear with a stick, and when the bear didn’t respond at first he whacked it over the nose with the stick instead. Now the bear’s got his leg and he’s screaming “oh no, protect my baby!” And it’s entirely his fault the baby’s in danger.

FaceDeer

But you’re claiming that there’s already no ladder. Your previous paragraph was about how nobody but the big players can actually start from scratch.

Adding cost only makes the threshold higher. The opposite of the way things should be going.

All this aside from the conceptual flaws of such legislation. You’d be effectively outlawing people from analyzing data that’s publicly available

How? This is a copyright suit.

Yes, and I’m saying that it shouldn’t be. Analyzing data isn’t covered by copyright, only copying data is covered by copyright. Training an AI on data isn’t copying it. Copyright should have no hold here.

Like I said in my last comment, the gathering of the data isn’t in contention. That’s still perfectly legal and anyone can do it. The suit is about the use of that data in a paid product.

That’s the opposite of what copyright is for, though. Copyright is all about who can copy the data. One could try to sue some of these training operations for having made unauthorized copies of stuff, such as the situation with BookCorpus (a collection of ebooks that many LLMs have trained on that is basically pirated). But even in that case the thing that is a copyright violation is not the training of the LLM itself, it’s the distribution of BookCorpus. And one detail of piracy that the big copyright holders don’t like to talk about is that generally speaking downloading pirated material isn’t the illegal part, it’s uploading it, so even there an LLM trainer might be able to use BookCorpus. It’s whoever it is that gave them the copy of BookCorpus that’s in trouble.

Once you have a copy of some data, even if it’s copyrighted, there’s no further restriction on what you can do with that data in the privacy of your own home. You can read it. You can mulch it up and make paper mache sculptures out of it. You can search-and-replace the main character’s name with your own, and insert paragraphs with creepy stuff. Copyright is only concerned with you distributing copies of it. LLM training is not doing that.

If you want to expand copyright in such a way that rights-holders can tell you what analysis you can and cannot subject their works to, that’s a completely new thing and it’s going down a really weird and dark path for IP.

FaceDeer

They’re the ones training “base” models. There are a lot of smaller base models floating around these days with open weights that individuals can fine-tune, but they can’t start from scratch.

What legislation like this would do is essentially let the biggest players pull the ladders up behind them - they’ve got their big models trained already, but nobody else will be able to afford to follow in their footsteps. The big established players will be locked in at the top by legal fiat.

All this aside from the conceptual flaws of such legislation. You’d be effectively outlawing people from analyzing data that’s publicly available to anyone with eyes. There’s no basic difference between training an LLM off of a website and indexing it for a search engine, for example. Both of them look at public data and build up a model based on an analysis of it. Neither makes a copy of the data itself, so existing copyright laws don’t prohibit it. People arguing for outlawing LLM training are arguing to dramatically expand the concept of copyright in a dangerous new direction it’s never covered before.

FaceDeer

I don’t think you’re familiar with the sort of resources necessary to train a useful LLM up from scratch. Individuals won’t have access to that for personal use.

FaceDeer

You realize that if cases like this are won then only the “giant fucking corporations” are going to be able to afford the datasets to train AI with?

FaceDeer

It’s tongue in cheek, but I rather like this one.

FaceDeer

Is it really a problem, though?

FaceDeer

It’s both capable and willing, the problem is that not everyone agrees with the solutions being used. And so they say “we’re doing it wrong” instead of “I think we’re doing it wrong.”

FaceDeer

Ah, that only happens right after launch when they’re still bunched together. Once the satellites get into their final orbit they spread out. The newer models also have anti-reflection systems that make them much harder to spot, SpaceX has been working with astronomers on that.

FaceDeer

At some point one of the big countries that’s being negatively by this is going to say “screw it, let’s try solar geoengineering.” The cost is actually pretty small on this scale.

Sure would be nice if there was a bunch of research to draw on when that time comes, instead of just flailing away.

FaceDeer

Low Earth orbit has been heavily commercialized for decades already. If you mean Starlink specifically, what’s wrong with it?

FaceDeer

Reusable rocketry, specifically SpaceX Starship. If it pans out it’s going to completely change our access to space and make many of those old dreams from the 1970s plausible.

RNA vaccines for basically everything, including customized vaccines for cancer. There’s also actual progress happening in general cures for autoimmune diseases.

Is robotics too close to AI? There are multiple companies working on general-purpose humanoid robots intended for mass production with price targets in the ten to twenty thousand dollar range, we may be getting within sight of actual robot butlers.

FaceDeer

Why do skilled professionals have less-skilled assistants?

FaceDeer

Yes, but it shows how an LLM can combine its own AI with information taken from web searches.

The question I’m responding to was:

I wonder why nobody seems capable of making a LLM that knows how to do research and cite real sources.

And Bing Chat is one example of exactly that. It’s not perfect, but I wasn’t claiming it was. Only that it was an example of what the commenter was asking about.

As you pointed out, when it makes mistakes you can check them by following the citations it has provided.

FaceDeer

Have you ever tried Bing Chat? It does that. LLMs that do websearches and make use of the results are pretty common now.

FaceDeer

Me, a programmer facing replacement by LLMs:

FaceDeer

Over the past month I feel like all I’ve been doing is writing tech design documents for systems I don’t actually know anything about because I haven’t had the opportunity to go in and do anything with them.

Fortunately I’ve finally managed to reach the point where everyone agrees that we should just start implementing the basics and see how that goes rather than try to plan it all out ahead of time since we’re surely going to have to throw out the later plans once we see what we’re actually dealing with.

FaceDeer

As Churchill once famously said, democracy is the worst form of government except for all those other forms that have been tried from time to time.

I guess one of the nicest things about democracy is that it has a built-in mechanism for removing a government. It may not be reliable at getting good leaders in place, but at least when there’s a bad one it has a way of getting them back out again without having to go in shooting.

FaceDeer

Israel elected those people. They bought what the peddlers of war were selling.

FaceDeer

Ah, we’re doing one of those full circle things. I actually remember the time when AOL was “the internet.”

FaceDeer

Going back to my original comment:

Sure, but the fact that not all AI isn’t really AI doesn’t mean it isn’t real.

The fact that Amazon was faking it in this one instance doesn’t poof all the actual AI out of existence. There are plenty of off-the-shelf AI models that are good enough for various particular problems, they can go ahead and use them. You said it yourself, the chatbot “help staff” might be actual LLMs.

At that point you might just try to figure out how to offload the work someone else.

As I said, most companies using AI will likely be hiring professional AI service providers for it. That’s where those hundreds of billions of dollars I mentioned above are going, where all the PhDs spending years on R&D are working.

FaceDeer

How is that Microsoft’s fault? Should they be forcing users to care, somehow? The warning is already getting people angry as it is.

FaceDeer

Did you read the article? The popup warns users about it, yes. It’s a good thing to let them know there won’t be more security updates for their OS.

FaceDeer

You’re still talking about training AIs, though. Using AIs doesn’t require years of work and PhDs to research. You just sign a contract with one of the AI service providers and they give you an API. You may need to do a little scripting to hook up a front end and some fiddling with prompts and parameters to get the AI to respond correctly, but as I said above, I’ve done this myself in my own home. Entirely on my own, entirely just for fun. It’s really not hard, I could point you to a couple of links for some free software you could use to do it yourself. Heck, even the training part isn’t hard if you’re starting with one of the existing open models and you’ve got the hardware for it.

Do you really think all those companies out there with chatbot “help staff” (that speak perfect English and respond faster than a well-trained typist could type) are most likely just outsourced workforce to some cheap foreign company? What is the hundreds of billions of dollars worth of computer hardware the AI service providers are running actually being used for, if not that?