The Rise Of Small Data

John Nosta 11/05/2023

Natural language processing (NLP) has been revolutionized by large language models which rely on massive training data and computational resources.

A growing number of researchers and developers are now shifting their focus towards smaller, curated datasets for training LLMs, challenging the conventional wisdom that bigger is always better when it comes to data. These smaller datasets are created using synthetic methods or scavenging from existing projects and are becoming increasingly popular due to their open-source nature and reduced computational resources and costs. The shift towards small data approaches has the potential to democratize access to cutting-edge AI technologies, and we can expect to see a more diverse and innovative landscape in natural language processing and AI as a whole.

Over the past decade, the field of natural language processing has been revolutionized by the advent of large language models like ChatGPT and BERT. These models have demonstrated unprecedented capabilities in a variety of tasks, owing to their massive training data and computational resources. However, a growing number of researchers and developers are now shifting their focus towards smaller, curated datasets for training LLMs. This trend challenges the conventional wisdom that bigger is always better when it comes to data.

Scaling Laws and Flexibility in Data Size

Recent and practical implementations suggest that there is more flexibility in the data scaling laws than previously thought. Rather than relying on vast amounts of raw data, many projects are now leveraging smaller, carefully selected datasets to achieve similar or even better performance. This is particularly relevant for organizations and researchers outside of major tech companies like Google, which have traditionally dominated the LLM landscape.

Synthetic Methods and Scavenging Techniques

Two main approaches have emerged to create these smaller, curated datasets: synthetic methods and scavenging from existing projects. Synthetic methods involve generating new data by filtering the best responses from an existing model, which can then be fine-tuned on specific tasks or domains. Scavenging, on the other hand, involves reusing and repurposing data from other projects to create a new dataset.

Neither of these methods is currently dominant at companies like Google, which tend to rely on massive, raw datasets. However, these curated datasets have proven to be highly effective in training LLMs, and they are rapidly becoming the standard in the broader AI community.

Open Source and Accessibility

One of the most significant advantages of these smaller, curated datasets is their availability as open-source resources. This means that researchers and developers worldwide can access and utilize these datasets for their projects, fostering innovation and collaboration. Furthermore, the use of smaller datasets reduces the computational resources and costs associated with training LLMs, making the technology more accessible to smaller organizations and individuals.

Bigger Isn’t Necessarily Better

The shift towards smaller, curated datasets in LLM training marks an important and interesting turning point in the field of natural language processing. By challenging the conventional wisdom of big data, this approach has opened up new possibilities for more efficient and effective language model training. The use of synthetic methods and scavenging techniques has made it possible to create high-quality datasets, which, when combined with their open-source nature, have the potential to democratize access to cutting-edge AI technologies. As these small data approaches become increasingly prevalent, we can expect to see a more diverse and innovative landscape in natural language processing and AI as a whole. 

Share this article

Leave your comments

Post comment as a guest