Taming Noisy Satellite Data: My Kaggle Journey with NDVI Land Cover Classification 🌱
In Summer Analytics 2025, I participated in a Kaggle hackathon hosted by CAC IIT Guwahati and GeeksforGeeks. The challenge? Classify land cover types (like water, forest, urban areas) using NDVI time-series data from satellite imagery and OpenStreetMap labels.
As a beginner in Kaggle competitions, this hackathon taught me valuable lessons about data cleaning, feature engineering, and model generalization. Here’s how I approached it:
🛰️ The Problem
Each sample had 27 NDVI readings over time, often with missing or noisy values due to clouds or crowdsourced label errors. The goal was to predict the correct class:
- 🌊 Water
- 🏙️ Impervious (urban areas)
- 🌾 Farm
- 🌳 Forest
- 🍃 Grass
- 🍎 Orchard
Two leaderboards existed:
- Public leaderboard: Noisy data (89% test set).
- Private leaderboard: Clean data (11% test set, used for final ranking).
My ranks:
- 📈 Public Rank: #666 / 1395 (47%)
- 🏆 Private Rank: #446 / 1395 (32%)
🧹 Tackling Noisy Data
The dataset had:
✅ Missing values due to cloud cover
✅ Seasonal variations in vegetation
✅ Noisy labels from OpenStreetMap
I used:
- Mean Imputation: To fill missing NDVI values
- Statistical Smoothing: Extracted mean, standard deviation, skewness, and trend to reduce noise sensitivity
🔥 Feature Engineering
To make the model robust, I engineered new features:
| Feature | Why? |
|---|---|
| Mean | Average vegetation over time |
| Standard Deviation | Variation in vegetation health |
| Trend | Seasonal growth or decline |
| Skewness | Asymmetry in vegetation signals |
| Kurtosis | Detect outliers (e.g., sudden changes) |
These features transformed raw time-series data into meaningful summaries.
🤖 Modeling
- Started with Logistic Regression (Multiclass) for a simple baseline
- Improved with Random Forest Classifier to handle noisy, non-linear patterns
- Scaled features using
StandardScalerfor better convergence
📊 Results
| Model | Cross-Validation Accuracy | Private Rank Improvement |
|---|---|---|
| Logistic Regression | ~86% | - |
| Random Forest | ~88% | Improved by 220 places |
🎯 What I Learned
✅ How to clean and preprocess noisy real-world satellite data
✅ The power of feature engineering in improving model performance
✅ Why generalization matters more than overfitting to public leaderboards
🌟 Final Thoughts
This Kaggle hackathon gave me hands-on experience with real-world data challenges. As a beginner, it taught me how to think like a data scientist: from cleaning messy data to building robust models.
👨💻 Next Steps: Try time-series models (like LSTMs or XGBoost) and explore spatial data visualization.
🏆 Proof of Participation
Here are my ranks in the Kaggle hackathon:
📊 Public Leaderboard Rank: #666 / 1395 (~47%)
📊 Private Leaderboard Rank: #446 / 1395 (~32%)