Training Data Considerations

"How much data do I need?" is a common question. The answer is not a single number. What matters most is how representative the training data is of the patterns the model needs to learn and the area where the model will be applied.

Is your data AI-ready? Below are some important considerations for data used in fine-tuning.

More data is (usually) better

More data, assuming good quality, will improve model performance. Oftentimes, there are many sources of relevant data that benefit your model by adding more examples of what you hope to detect and/or other classes to help distinguish from what you want to identify.

Data quality

As the old saying goes, "garbage in, garbage out." You should understand the quality of labels, including doing a visual inspection of the data before creating a model. If your data quality is mixed, you may consider evaluating how well your model performs with all the data vs. only higher-confidence data.

Spatial distribution

Many points/polygons collected from a small location are often less helpful than the same number collected from a broader area, as spatial diversity helps the model generalize. Too much clustering can make some labels redundant. In general, ground truth observations should reflect the spatial heterogeneity of your area of interest, though the right balance depends on your task and deployment area.

Class distribution

Your labeled data (ground truth or annotations) should be evenly distributed to prevent the model from overfitting to the majority classes and neglecting the minority ones. Ideally, the spatial and categorical distribution of training samples should approximate the expected distribution across your area(s) of interest.

Negative / Non-target data

In addition to learning positive examples, models also require negative or non-target classes to produce high-quality results. A couple of examples:

A quality model for detecting man-made water reservoirs needs examples of non-reservoirs (e.g., lakes, rivers) to distinguish them from the desired output of man-made reservoirs.
If you run inference over a broad area but are only interested in certain features, you can group non-target classes into a single "other" class. For example, a crop type model may include crop classes (e.g., wheat, maize, soy) and an "other" class representing all other land cover, such as non-target crops, built-up areas, or water bodies. If inference is limited to a well-defined area (e.g., high-quality cropland extent), an "other" class may not be necessary if the training data is representative of the crops within that area.

If you do not have annotations of negative examples or other classes expected within the fine-tuning and inference area, the model may incorrectly classify unrelated features as the target class.

Volume of labeled data

A minimum number of labels per class should reflect the area or interest's heterogeneity and the task's complexity. For example, water vs land in a small, homogenous area is simple compared to more complex or nuanced classes, like different types of trees.

Foundation models can reduce the amount of training data required to achieve reasonably accurate results; however, this should not be misconstrued to mean that extremely scarce data will produce good results.

For global mangrove mapping, 100k labels achieved an F1 score of 0.981, while just 10k points achieved a similar score of 0.969.
For crop type mapping in a 3000 km2 area, 1000 labels yielded an average F1 score of 0.79; increasing the number of labels continued to improve the F1 until all 3000 labels were used, yielding a 0.92 F1 score.

Polygon data

Polygons should be clean, continuous shapes. Self-intersecting edges (e.g., figure-eight or bowtie shapes) are a common issue when collecting data with GPS devices or drawing polygons manually. OlmoEarth will attempt to repair these automatically, but for best results, review and clean your polygon data before uploading. Polygons made up of multiple separate parts are also supported.

More data is (usually) better​

Data quality​

Spatial distribution​

Class distribution​

Negative / Non-target data​

Volume of labeled data​

Polygon data​