Concurrency and Race Condition Management
Operating as a distributed system with asynchronous messaging, DroneFleet Optimizer faces inherent concurrency challenges. Multiple components interact simultaneously — the Path Optimizer runs optimization cycles every ~10 seconds, the State Manager processes telemetry, orders, and decisions concurrently, Pub/Sub delivers messages with no ordering guarantees, and Firestore serves as the single source of truth. The system uses eventual consistency for reads combined with strong consistency at write time to ensure correctness without sacrificing performance.
The Core Challenge: Concurrent Optimization Cycles
The primary race condition arises when two optimization cycles overlap and both include the same drone or order in their snapshots:
Timeline:
=========
T0: Cycle A starts, calls getSnapshot()
-> Drone D1 is IDLE -> included in snapshot A
T1: Cycle B starts, calls getSnapshot()
-> Drone D1 is STILL IDLE -> included in snapshot B
(A hasn't finished yet, so D1 status unchanged)
T2: Cycle A computes solution -> Assigns D1 to Order O1
T3: Cycle B computes solution -> Assigns D1 to Order O2
T4: Cycle A publishes decision (D1 -> O1)
T5: State Manager processes A's decision
-> Transaction: D1 is IDLE? YES
-> SUCCESS: D1.status = MOVING, Mission M1 created
T6: Cycle B publishes decision (D1 -> O2)
T7: State Manager processes B's decision
-> Transaction: D1 is IDLE? NO (it's MOVING)
-> REJECTED: BusinessRejectionException thrown
-> Order O2 remains PENDING for next cycle
The system correctly prevents double-assignment through write-time validation inside a Firestore transaction.
First-Write-Wins Strategy
The conflict resolution model is first-write-wins, not first-start-wins. The first transaction to commit wins, regardless of which optimization cycle started first. This is a deliberate design choice:
- Simpler implementation: No distributed lock management, no session-based reservation system.
- No deadlock risk: Since there is no pessimistic locking, there is no possibility of deadlock.
- Acceptable waste: Given that optimization cycles run every ~10 seconds and the solver takes ~8 seconds, overlapping cycles are infrequent. The occasional rejected decision is recovered naturally in the next cycle.
An alternative approach (first-start-wins with pessimistic locking via RESERVED/SOLVING states) was considered. While it would reduce wasted computation, it introduces significant complexity: distributed lock management, session cleanup for crashed optimizers, and potential deadlocks.
Firestore Transaction Pattern: Mission Assignment (Critical Path)
The mission assignment is the most complex transaction in the system. It validates and applies decisions atomically across multiple documents:
firestore.runTransaction(transaction -> {
// Read all documents FIRST (Firestore requirement)
DocumentSnapshot droneDoc = transaction.get(droneRef).get();
List<DocumentSnapshot> orderDocs = /* read all orders */;
// Convert to domain objects
Drone drone = FirestoreMapper.toDrone(droneDoc);
List<Order> orders = /* convert all orders */;
// Execute business logic (MissionAssignmentPolicy)
// - drone.status == IDLE (DronePolicy.canAcceptMission)
// - all orders.status == PENDING
// - If ANY validation fails -> BusinessRejectionException
// Write all changes atomically
transaction.set(missionRef, missionData); // Create Mission
transaction.update(droneRef, droneUpdates); // drone.status = MOVING
for (Order order : orders) {
transaction.update(orderRef, orderUpdates); // order.status = ASSIGNED
}
return result;
});
Key properties: - All reads happen before any writes (Firestore requirement for optimistic concurrency) - If any document was modified by another transaction between read and commit, Firestore automatically retries the entire transaction - Either all writes succeed, or none do (atomicity) - For multi-order missions, if validation fails for any single order, the entire transaction is rejected — the drone remains IDLE and all orders remain PENDING
Optimistic Locking: How Firestore Handles Contention
Firestore implements optimistic concurrency control natively. When two transactions attempt to modify the same document concurrently:
- Transaction A reads drone D1 (status = IDLE), Transaction B reads drone D1 (status = IDLE)
- Transaction A commits first → SUCCESS: D1.status = MOVING
- Transaction B attempts to commit
- Firestore detects that D1 was modified since B's read
- Firestore automatically retries Transaction B from the beginning
- Transaction B re-reads D1 (now status = MOVING)
MissionAssignmentPolicyvalidation fails →BusinessRejectionException- The rejected decision is logged, and the affected entities are picked up in the next optimization cycle
This is a form of optimistic locking — there is no explicit lock acquisition. Instead, conflicts are detected at commit time and resolved by retry. The MissionAssignmentPolicy acts as the business-level guard, ensuring that only valid state transitions are committed.
Telemetry Ordering Protection
Network conditions can cause telemetry messages to arrive out of order. The State Manager protects against stale data via timestamp comparison:
firestore.runTransaction(transaction -> {
DocumentSnapshot doc = transaction.get(droneRef).get();
if (doc.exists()) {
Instant existingTimestamp = /* get lastUpdate from doc */;
Instant incomingTimestamp = telemetry.getTimestamp();
if (incomingTimestamp.isBefore(existingTimestamp)) {
return null; // Skip stale telemetry — do not apply older data
}
}
Drone updated = DronePolicy.applyTelemetryUpdate(drone, telemetry);
transaction.set(droneRef, FirestoreMapper.toMap(updated));
return updated;
});
If a telemetry message T1 (timestamp=10:00:01) arrives after T2 (timestamp=10:00:02), T1 is silently discarded. The drone state always reflects the most recent known data.
Order Ingestion Idempotency
Pub/Sub guarantees at-least-once delivery, meaning the same order creation message may be delivered multiple times. The order ingestion handler includes an idempotency guard:
firestore.runTransaction(transaction -> {
DocumentSnapshot doc = transaction.get(orderRef).get();
if (doc.exists()) {
OrderStatus currentStatus = /* get status from doc */;
if (currentStatus != PENDING && currentStatus != UNSPECIFIED) {
return null; // Do not overwrite — order already processed
}
}
Order order = /* build order with PENDING status */;
transaction.set(orderRef, FirestoreMapper.toMap(order));
return order;
});
This prevents a redelivered message from resetting an ASSIGNED order back to PENDING, which would cause it to be re-assigned and potentially create duplicate missions.
Protection Mechanisms Summary
| Mechanism | Location | Protection Provided |
|---|---|---|
| Firestore Transaction | FirestoreStateTransactionAdapter |
Atomic multi-document writes |
| Optimistic Concurrency | Firestore built-in | Automatic retry on contention |
| Write-time Validation | MissionAssignmentPolicy |
Status checks before assignment |
| Drone Status Guard | DronePolicy.canAcceptMission() |
Only IDLE drones can accept missions |
| Order Status Guard | MissionAssignmentPolicy |
Only PENDING orders can be assigned |
| Timestamp Ordering | runTelemetryUpdateTransaction |
Reject stale telemetry |
| Idempotency Guard | runOrderIngestionTransaction |
Prevent resetting processed orders |
Known Limitations and Future Improvements
-
Non-transactional snapshot acquisition: The snapshot queries for drones and orders are separate, non-transactional reads. State may change between the two queries. This is acceptable because write-time validation catches any inconsistencies, but a future
runSnapshotAcquisitionTransaction()could atomically read all entities. -
Unused session tracking: The
solvingSessionIdfield exists in protobuf definitions but is not yet implemented. It would allow marking entities as RESERVED/SOLVING during optimization, reducing wasted computation from overlapping cycles. -
No abandoned session cleanup: If an optimizer crashes mid-cycle, a TTL-based or heartbeat-based cleanup mechanism would be needed to release reserved entities.
Correctness guarantee: No race condition can cause incorrect data (double-assignment, lost orders). The system may waste computation on rejected decisions, but this is an acceptable trade-off for simplicity and reliability.