Wednesday, July 3, 2024

Microsoft's AI Chief Says Your Content Is Fair Game If It's on the Open Web

The warning used to be that anything you put on the internet stays there--somewhere--forever. The advent of artificial intelligence models has put a twist on this, and now it's that anything you post online will end up in an AI, whether you want it to or not. That is, at least, what Mustafa Suleyman, co-founder of Google's DeepMind AI unit and now CEO of AI at Microsoft, thinks. According to Suleyman, it's fine for AI companies to scour every corner of the open web--which is, arguably, anything on any website that's not protected behind a paywall or login interface--and to use what they find to train their algorithms. In an era of rapid growth by data-hungry generative AI services, it's a stark reminder that you or your company should never publish anything on your website or a social media service that you don't want to be publicly accessible. Fair use or abuse? Speaking to CNBC at the Aspen Ideas Festival last week, Suleyman referred to the widely accepted idea that whatever content is "on the open web" is freely available to be used by someone else under fair use principles, news site Windows Central reports. This is the notion that if you quote, review, reimagine, or remix small parts of someone else's content, that's OK. Specifically Suleyman said that for fair use content, "anyone can copy it, recreate with it, reproduce it." Note his use of the word "copy," rather than simply "reference" or "remix." Suleyman is implying that if someone has published text, imagery, or any other material on a website, it's fine for companies like his to access it wholesale. This is already somewhat questionable: Fair use isn't designed to enable outright copying, and one of the big no-noes is copying someone else's work for your own financial gain. But Suleyman's next words will worry critics who think big AI's powers are already too vast. Suleyman acknowledged that a publisher or a news organization can explicitly mark its content so that systems like Google's web-indexing bots--the code robots that tell Google's algorithm where everything is online--either can or cannot access the info. He also noted that some also mark content as OK for bots to access only for search engines, but not "for any other reason," such as AI data training. He said this is a gray area, and one that he thinks is "going to work its way through the courts." Suleyman may be hinting that he thinks sites simply shouldn't be able to bar their content from being looked at by AI. This is timely, since the high-profile AI firm Perplexity is in the spotlight for allegedly ignoring exactly this sort of online content marking, and then scraping website's data without permission. Data, data everywhere, for any AI to "drink" The big problem is that generative AI needs tons and tons of data to work; without data, it's just a big box of complicated math. With data, AI algorithms are shaped to reply with real-world information when you type in a query. In the hunt for this data, some companies, like Apple, are partnering with content publishing services, like Reddit, to gain access to billions of pieces of text, photos, and more that users have uploaded over the years. But other companies have been accused of questionably or even illegally snaffling up data that they really shouldn't take. The New York Times and other newspapers have launched lawsuits based on this, and three big record labels just sued some music-generating AI companies that they claim have illegally copied their music archives. As more and more companies, from one-person businesses to giant enterprises, embrace AI technology, this is yet another reminder that you need to be careful when you use answers spit out from a chatbot: Someone needs to check they're not violating another company's intellectual property rights before you use it. It's also a stark warning that you should be very careful that your website, social media posts, or any other content you publish online isn't giving away proprietary data you'd prefer to keep private, because it could simply end up being used to train someone's smart AI model. BY KIT EATON @KITEATON

No comments: