Building GIS Applications at Scale

Oct 21, 2022 / 6 min read

Building geospatial applications that interface with supercomputers presents unique architectural challenges. Here’s what I learned while developing GIS systems at my previous job, where we built systems to support geological research at scales.

The Challenge

Our team faced a complex challenge: creating a web application for a research group that could handle massive geological datasets while providing real-time analysis capabilities. The application needed to process terabytes of geological survey data collected from various sensors and field studies across multiple sites.

The real-time rendering requirement was particularly challenging because researchers needed to visualize complex geological formations and sensor data simultaneously. Additionally, multiple research teams needed concurrent access to the system, each running intensive computations on different datasets.

Most challenging was maintaining application responsiveness while executing resource-intensive computations. Traditional web architectures weren’t designed for this level of computational intensity, especially when dealing with geospatial data.

Key Architecture Decisions

1. Data Storage and Access Patterns

Our multi-tiered storage strategy was born from careful analysis of data access patterns. We discovered that researchers typically worked intensively with recent data (last 3-6 months) while occasionally needing historical data for comparative analysis.

In PostGIS, we stored recent and frequently accessed data, optimizing for spatial queries with carefully designed indexes. We implemented custom spatial partitioning schemes to handle the massive scale of geological data. This allowed us to maintain query performance even as datasets grew.

The warm data tier in distributed object storage served as a middle ground. We used MinIO clusters to store data that wasn’t actively being analyzed but might be needed soon. This tier was particularly useful for storing processed results from supercomputer jobs that researchers might want to reference.

For long-term storage, we leveraged the supercomputer’s parallel file system. This tier handled petabytes of historical data and raw sensor readings. We implemented automated data lifecycle management policies to move data between tiers based on access patterns and research project phases.

Computation Distribution

Our computation distribution strategy evolved from early performance bottlenecks. Here’s how we implemented our job distribution system:

# Job Router for different computation types
class GeoJobRouter:
    def route_computation(self, job_request):
        if job_request.estimated_complexity < 1000:
            return self.process_locally(job_request)
        elif job_request.estimated_complexity < 10000:
            return self.send_to_kubernetes(job_request)
        else:
            return self.schedule_supercomputer(job_request)
 
    def process_locally(self, job):
        # Direct PostGIS query for simple calculations
        return self.db.execute("""
            SELECT ST_Area(geom), ST_Perimeter(geom)
            FROM geological_formations
            WHERE region_id = %s
        """, [job.region_id])
 
class SupercomputerJob:
    def submit(self):
        job_script = self.prepare_job_script()
        job_id = self.scheduler.submit(job_script)
        return self.create_status_websocket(job_id)

For medium tasks, we built a dedicated computing cluster using Kubernetes. This handled operations like generating complex visualizations, preprocessing data for analysis, and running preliminary calculations. We used Redis for job queuing and implemented retry mechanisms for resilience.

Heavy processing tasks went to the supercomputer through a custom scheduler interface. We developed a job management system that could track long-running computations, handle failures gracefully, and provide real-time status updates to users.

Map Tile Strategy

Our map tile implementation was crucial for performance. We developed a dynamic vector tile generation system that could handle complex geological data layers while maintaining responsive performance.

The MapboxGL integration required careful optimization. We implemented custom layer rendering logic to handle specialized geological symbolization requirements. Our team developed extensions to support real-time data updates and custom geological visualization types not available in standard mapping libraries.

The tile cache became sophisticated over time. We implemented multiple invalidation strategies based on data update patterns. For example, sensor data tiles had short cache lifetimes, while geological formation tiles could be cached longer. We built a predictive prefetching system that would prepare tiles based on user navigation patterns.

API Design for Scale

Our GraphQL API design focused on optimizing data transfer while maintaining flexibility. Here’s an example of our spatial data handling:

// GraphQL schema with custom spatial directives
const typeDefs = gql`
	directive @spatial(srid: Int = 4326) on FIELD_DEFINITION
 
	type GeologicalFeature {
		id: ID!
		geometry: Geometry! @spatial
		properties: FeatureProperties!
	}
 
	type Query {
		featuresInRegion(bounds: BoundingBox!, cursor: String): FeatureConnection!
	}
`;
 
// Binary protocol for real-time sensor data
class SensorDataProtocol {
	encode(data: SensorReading[]): ArrayBuffer {
		const buffer = new ArrayBuffer(data.length * 20); // Fixed size per reading
		const view = new DataView(buffer);
		data.forEach((reading, i) => {
			view.setFloat64(i * 20, reading.timestamp);
			view.setFloat32(i * 20 + 8, reading.value);
			view.setFloat32(i * 20 + 12, reading.lat);
			view.setFloat32(i * 20 + 16, reading.lng);
		});
		return buffer;
	}
}

The cursor-based pagination system was designed specifically for large geospatial datasets. We implemented spatial cursors that could efficiently page through geographic regions while maintaining consistent ordering.

Monitoring and Performance

Our monitoring system became increasingly sophisticated as we identified performance bottlenecks. We tracked tile generation time across different zoom levels and data types, allowing us to optimize the most problematic cases.

Spatial query performance monitoring helped identify slow queries before they impacted users. We implemented automated query analysis tools that would suggest index improvements and identify problematic data access patterns.

Memory usage monitoring was crucial for preventing system instability. We developed custom metrics for tracking memory usage during large computations and implemented automatic job splitting when memory pressure became too high.

Lessons Learned

Understanding data access patterns proved crucial. We initially assumed uniform access across all data but discovered that researchers typically focused intensively on specific geographic regions during project phases. This insight led to more efficient caching strategies and data partitioning approaches.

Failure handling became a core architectural concern. Supercomputer jobs could fail for various reasons or get stuck in queues for hours. We implemented comprehensive retry logic and developed clear failure communication channels to keep researchers informed of job status.

Our caching strategy evolved significantly. We learned to cache aggressively but implemented sophisticated invalidation mechanisms. For example, when new sensor data arrived, we could selectively invalidate only affected geographic regions rather than entire tile sets.

The monitoring system proved invaluable for proactive optimization. We could identify performance trends before they became problems and make data-driven decisions about architecture improvements.

Impact

The final architecture supported over 200 concurrent researchers working with datasets exceeding 50TB. Response times remained under one second for most queries, even during peak usage periods. The system maintained 99.9% uptime during critical research periods, becoming a reliable platform for geological research.

Future Improvements

We’re exploring vector tile compression techniques to reduce data transfer sizes further. Initial experiments with custom compression algorithms show promising results for geological data types.

WebAssembly integration could move more processing to the client side, reducing server load. We’re developing prototypes for common geological calculations that could run entirely in the browser.

Real-time collaboration features are in development, allowing multiple researchers to analyze the same geographic regions simultaneously while seeing each other’s annotations and calculations.

Building GIS applications at this scale taught us valuable lessons about balancing performance, cost, and complexity. The key was maintaining system reliability while pushing the boundaries of what’s possible with web-based geospatial applications.