Natural language processing (NLP) has been revolutionized by large language models which rely on massive training data and computational resources.
A growing number of researchers and developers are now shifting their focus towards smaller, curated datasets for training LLMs, challenging the conventional wisdom that bigger is always better when it comes to data. These smaller datasets are created using synthetic methods or scavenging from existing projects and are becoming increasingly popular due to their open-source nature and reduced computational resources and costs. The shift towards small data approaches has the potential to democratize access to cutting-edge AI technologies, and we can expect to see a more diverse and innovative landscape in natural language processing and AI as a whole.
Over the past decade, the field of natural language processing has been revolutionized by the advent of large language models like ChatGPT and BERT. These models have demonstrated unprecedented capabilities in a variety of tasks, owing to their massive training data and computational resources. However, a growing number of researchers and developers are now shifting their focus towards smaller, curated datasets for training LLMs. This trend challenges the conventional wisdom that bigger is always better when it comes to data.
Recent and practical implementations suggest that there is more flexibility in the data scaling laws than previously thought. Rather than relying on vast amounts of raw data, many projects are now leveraging smaller, carefully selected datasets to achieve similar or even better performance. This is particularly relevant for organizations and researchers outside of major tech companies like Google, which have traditionally dominated the LLM landscape.
Two main approaches have emerged to create these smaller, curated datasets: synthetic methods and scavenging from existing projects. Synthetic methods involve generating new data by filtering the best responses from an existing model, which can then be fine-tuned on specific tasks or domains. Scavenging, on the other hand, involves reusing and repurposing data from other projects to create a new dataset.
Neither of these methods is currently dominant at companies like Google, which tend to rely on massive, raw datasets. However, these curated datasets have proven to be highly effective in training LLMs, and they are rapidly becoming the standard in the broader AI community.
One of the most significant advantages of these smaller, curated datasets is their availability as open-source resources. This means that researchers and developers worldwide can access and utilize these datasets for their projects, fostering innovation and collaboration. Furthermore, the use of smaller datasets reduces the computational resources and costs associated with training LLMs, making the technology more accessible to smaller organizations and individuals.
The shift towards smaller, curated datasets in LLM training marks an important and interesting turning point in the field of natural language processing. By challenging the conventional wisdom of big data, this approach has opened up new possibilities for more efficient and effective language model training. The use of synthetic methods and scavenging techniques has made it possible to create high-quality datasets, which, when combined with their open-source nature, have the potential to democratize access to cutting-edge AI technologies. As these small data approaches become increasingly prevalent, we can expect to see a more diverse and innovative landscape in natural language processing and AI as a whole.
John is the #1 global influencer in digital health and generally regarded as one of the top global strategic and creative thinkers in this important and expanding area. He is also one the most popular speakers around the globe presenting his vibrant and insightful perspective on the future of health innovation. His focus is on guiding companies, NGOs, and governments through the dynamics of exponential change in the health / tech marketplaces. He is also a member of the Google Health Advisory Board, pens HEALTH CRITICAL for Forbes--a top global blog on health & technology and THE DIGITAL SELF for Psychology Today—a leading blog focused on the digital transformation of humanity. He is also on the faculty of Exponential Medicine. John has an established reputation as a vocal advocate for strategic thinking and creativity. He has built his career on the “science of advertising,” a process where strategy and creativity work together for superior marketing. He has also been recognized for his ability to translate difficult medical and scientific concepts into material that can be more easily communicated to consumers, clinicians and scientists. Additionally, John has distinguished himself as a scientific thinker. Earlier in his career, John was a research associate at Harvard Medical School and has co-authored several papers with global thought-leaders in the field of cardiovascular physiology with a focus on acute myocardial infarction, ventricular arrhythmias and sudden cardiac death.