MLCommons debuts with public 86,000-hour speech dataset for AI researchers

MLCommons debuts with public 86,000-hour speech dataset for AI researchers

3 years ago
Anonymous $y15ULlV7sG

https://techcrunch.com/2020/12/03/mlcommons-debuts-first-public-database-for-ai-researchers-with-86000-hours-of-speech/

If you want to make a machine learning system, you need data for it, but that data isn’t always easy to come by. MLCommons aims to unite disparate companies and organizations in the creation of large public databases for AI training, so that researchers around the world can work together at higher levels, and in doing so advance the nascent field as a whole. Its first effort, the People’s Speech dataset, is many times the size of others like it, and aims to be more diverse as well.

MLCommons is a new non-profit related to MLPerf, which has collected input from dozens of companies and academic institutions to create industry-standard benchmarks for machine learning performance. The endeavor has met with success, but in the process the team encountered a paucity of open datasets that everyone could use.

MLCommons debuts with public 86,000-hour speech dataset for AI researchers

Dec 3, 2020, 5:40pm UTC
https://techcrunch.com/2020/12/03/mlcommons-debuts-first-public-database-for-ai-researchers-with-86000-hours-of-speech/ > If you want to make a machine learning system, you need data for it, but that data isn’t always easy to come by. MLCommons aims to unite disparate companies and organizations in the creation of large public databases for AI training, so that researchers around the world can work together at higher levels, and in doing so advance the nascent field as a whole. Its first effort, the People’s Speech dataset, is many times the size of others like it, and aims to be more diverse as well. > MLCommons is a new non-profit related to MLPerf, which has collected input from dozens of companies and academic institutions to create industry-standard benchmarks for machine learning performance. The endeavor has met with success, but in the process the team encountered a paucity of open datasets that everyone could use.