A new paper about cities 🏙️, their structure 🧩 and satellites 🛰️!
"Decoding (urban) form and function using spatially explicit deep learning" by me and @darribas
https://doi.org/10.1016/j.compenvurbsys.2024.102147
We took spatial signatures and built a neural network to pick them up from Sentinel 2. And since it didn't show great results, we mixed in some geographic data science to learn if and how much it helps.
A quick 🧵 on work that took us 3 years to finish 🙃.
The inclusion of spatial modelling 'on top' of the predictions from the CNN proved to consistently improve the performance of the modelling pipeline, while some other techniques showed occasional success.
In the end, we managed to improve the accuracy of the prediction by almost 0.2 in some cases, bringing the quality of the tricky signature prediction up to the level of common land use or land cover prediction models.
The link to the open access paper is in the first post. The code is available at https://urbangrammarai.xyz/signature_ai/ and the intermediate data at https://doi.org/10.6084/m9.figshare.26203367.
@martinfleis Awesome, this seems like a good thread to ask geo x ai questions 🤓
I always wondered why it's not more popular to take multiple scenes as inputs as in: video model or a transformer with positional embeddings encoding a sequence of scenes.
In addition I expected positional encodings of time and space so that the model can learn: this chip is from North America in December, so e.g. I can expect snow in the image.
Most earth observation models still seem to work on image data only?
@djh I think one barrier here is bootstrapping vit training, which as I understand it can be quite tricky at scale – they don’t have the same convexity advantages as CNNs. I’m basing this mostly on https://www.sciencedirect.com/science/article/pii/S003442572300439X (presented well in https://www.youtube.com/watch?v=vacNCxMA_Gw).
@djh (I am kind of mildly anti-vit, or anti vits being applied everywhere for no particular reason (see ConvNeXt, which I found pretty convincing), but I think the spatiotemporal stuff starts to get into the level of complexity where they might be justified.)
@djh For sure. I share this annoyance. Also:
– Not even trying to understand miltispectral bands.
– Sample-normalizing calibrated, linear data, thereby losing information.
– Augmenting in ways that might make sense for snapshots but do not reflect anything that you would actually see in geospatial data.
– Assuming there is little information in fine details because they’re used to Bayer demosaicked images, where that’s interpolated anyway.