Cutting-edge creativity is flourishing in a digital age driven by the unbridled power of AI – but it’s also behind a bitter war over access to the raw material AI needs: any and all data made freely available. Now, a battle is raging around a nonprofit web archive called Common Crawl that’s in the crosshairs of a fight that could undermine the open web as we’ve known it. At stake are issues fundamental to basic copyright, the insatiable data appetite of AI, and the future of artists and researchers around the globe.
At a surface level, Common Crawl’s mission sounds as harmless as it is commendable: to crawl the web and collect an open archive of the entire domain. But by shouldering this role as the foremost data resource for AI development, it has come into direct confrontation with media outlets and copyright owners – a confrontation that began with Danish media outlets demanding that Common Crawl stop using their content in its dataset. Now backed by major international outlets such as The New York Times, their complaint stands for resistance against AI companies appropriating copyrighted content for development of AI without their permission or payment.
This David and Goliath tale is part of a broader tug of war over the values of open access and innovation, Common Crawl’s surrender to these demands, done out of a pragmatic desire to avoid expensive litigation, appears to be a small loss but it is a loss that portends a devastating blow to the ideals of the open web, which could be seen by many as the central locus of innovation, free sharing of knowledge and democratic access to information.
The stakes are far greater than Common Crawl versus the media companies in their legal dispute. No one stands to lose more than academics, researchers and small-scale AI developers, as Common Crawl’s openness has allowed countless research projects, from identifying patterns in internet censorship in authoritarian states to improving fraud detection methods. If finding that data becomes more difficult, the pipeline for innovation will dry up, hindering research from a vast proportion of scientific and technological fields.
Strictly enforcing copyright protections would, ironically, also serve to strengthen the dominance of today’s market Goliaths, like OpenAI. Driving out not-for-profit operations such as Common Crawl would leave the field in the hands of the bigger, wealthier outfits, who can afford to pay for the army of data crawlers needed to crawl the web. This might perpetuate established power relationships in the AI sector, by rendering the field uncompetitive and stifling innovation by ratcheting up the level at which new and smaller entrants must play.
In many ways, the drama exemplifies the wider dilemma of the digital age: how to balance the rights of copyright-holders with the need for open access and innovation. The Danish media’s seemingly coordinated efforts to shut down the AI’s access to their content reflect an entire sector’s attempt to enforce the payment of fair remuneration. Yet the reciprocal aim of ensuring that AI has sufficient access to help advance its development and evolution is still a distant and illusory goal.
So long as humans grapple with these issues, answers might arise – whether through smarter copyright law or alternative licensing mechanisms – to strike a balance between providing compensation and preserving the digital commons of research and innovation. Whichever way this battle plays out could largely determine the trajectory of AI development and the rest of the digital world for decades to come.
Discussion of Common Crawl’s problems serves as a proxy for another, more global debate about the future of the open web. It is not just a question of tech speak; ‘open’ is a philosophy, a call for the availability of something to (almost) everybody, to anyone, for the sake of collaboration and shared effort. As this debate plays out, the larger question of what happens to open as the web becomes more closed is not just about the legal status or the economic impact of access, but about what a future might look without openness, and what might be lost for a society that looks to the web for so many things. The future of innovation depends upon it, and relies on a healthy ecosystem of makers, researchers and entrepreneurs to flourish.
This choice is now vital for publishers, copyright-holders and makers of AI: whether to reassert the control that is today’s sole prerogative; whether to lose the potential benefits of the enormous changes in media and technology; and, in these savaging times, our collective sense of freedom. Openness remains the centrepiece of the digital dream. As we draw towards its culmination, we must keep the principle and its inspiration alive.
© 2024 UC Technology Inc . All Rights Reserved.