The home above is divided into four security zones.
We can think of them as a sequence, ordered by their importance: [A, B, C, D].
If we have n live robots, we want the first n zones in the list to be patrolled by at least one robot.
n== 2 → patrolled(A, B);
n == 3 → patrolled(A, B, C)
1 <= n <= 4
The goal is for this condition to be an invariant for our program, despite robot failures.
Let robotz refer to the robot patrolling zone Z.
Illustration:
n == 4 → patrolled(A, B, C, D)
robotA dies
n == 3 → patrolled(A, B, C)
robotB dies
n == 2 → patrolled(A, B)
robotA dies
n == 1 → patrolled(A)
We need to:
map the home
SLAM (Simultaneous Localization and Mapping) with the gmapping package
have multiple robots patrol the security zone in parallel
AMCL (Adaptive Monte Carlo Localization) and the move_base package
One thread per robot with a SimpleActionClient sending move_base goals
have the system be resilient to robot failures, and adjust for the invariant appropriately
Rely on Raft’s Leader Election Algorithm (‘LEA’) to elect a leader robot.
The remaining robots will be followers.
The leader is responsible for maintaining the invariant among followers.
More on system resiliency:
The leader always patrols zone A, which has the highest priority.
On election, the leader assigns the remaining zones to followers to maintain the invariant.
On leader failure:
global interrupts issued to all patrol threads (all robots stop).
the LEA elects a new leader
On follower failure:
local interrupts issued by leader to relevant patrol threads (some robots stop).
Example:
n==4 → patrol(A, B, C, D)
robotB dies
leader interrupts robotD
leader assigns robotD to zone B
n==3 → patrol(A, B, C)
Two points of entry: real_main.launch and sim_main.launch.
real_main.launch:
used for real-world surveillance of our home environment
runs the map_server with the home.yaml and home.png map files obtained by SLAM gmapping.
All robots will share this map_server
launches robots via the real_robots.launch file (to be shown in next slide)
launches rviz with the real_nav.rviz configuration
sim_main.launch:
used for simulated surveillance of gazebo turtlebot3_stage_4.world
works similarly to real_main.launch except it uses a map of stage_4 and sim_robots.launch to run gazebo models of the robots via tailored urdf files.
Based on Diego Ongaro et al. 2014, “In Search of an Understandable Consensus Algorithm”
The full Raft algorithm uses a replicated state machine approach to maintain a set of nodes in the same state. Roughly:
Every node (process) starts in the same state, and keeps a log of the commands it is going to execute.
The algorithm elects a leader node, who ensures that the logs are properly replicated so that they are identical to each other.
Raft’s LEA and log replication algorithm are fairly discrete. We will only use the LEA.
Background:
Each of the robots is represented as an mp (member of parliament) node.
mp_roba, mp_rafael, etc.
Each node is in one of three states: follower, candidate, or leader.
If a node is a leader, it starts a homage_request_thread that periodically publishes messages to the other nodes to maintain its status and prevent new elections.
Every mp node has a term number, and the term numbers of mp nodes are exchanged every time they communicate via ROS topics.
If an mp node’s term number is less than a received term number, it updates its term number to the received term number.
LEA:
On startup, every mp node is in the follower state, and has:
a term number; and
a duration of time called the election timeout.
When a follower node receives no homage request messages from a leader for the duration of its election timeout, it:
increments its term number;
becomes a candidate;
votes for itself; and
publishes vote request messages to all the other mp nodes in parallel.
When a node receives a request vote message it will vote for the candidate iff:
the term number of the candidate is at least as great as its own; and
the node hasn’t already voted for another candidate.
Once a candidate mp node receives a majority of votes it will:
become a leader;
send homage request messages to other nodes to declare its status and prevent new elections.
The election timeout duration of each mp node is a random quantity within 2 to 3 seconds.
This prevents split votes, as it’s likely that some follower will timeout first, acquire the necessary votes, and become the leader.
Still, in the event of a split vote, candidates will:
reset their election timeout durations to a random quantity within the mentioned interval;
wait for their election timeout durations to expire before starting a new election.
The election timeout feature is part of what enables leader resiliency in my code.
When a leader node is killed, one of the follower nodes time out and become the new leader.
roscore keeps a list of currently active nodes. So:
followers, before transitioning to candidate state, update their list of live mp nodes by consulting roscore.
leaders do the same via a follower_death_handler thread, which runs concurrently with main thread and the homage_request thread, and maintains the invariant.
The mp nodes are all running on the remote VNC computer, which also runs the roscore.
So killing the mp node is just a simulation of robot failure, not true robot failure.
We’re just killing a process on the vnc computer, not the robot itself
We can remedy this by binding the mp node’s life to any other nodes that are crucial to our robot fulfilling its task.
E.g., for our project the /roba/amcl and /roba/move_base nodes are critical.
So we can consult roscore periodically to see if any of these nodes die at any given point. And if they do, we can kill mp_roba.
We can also define and run more fine-grained custom nodes that keep track of any functional status of the robot we’re interested in, and bind the life of those custom nodes to the mp nodes.
Running each mp node on its corresponding robot is a bad idea!
Only if a robot gets totally destroyed, or the mp process fails, will the algorithm work.
Performance will take a hit:
bandwidth must be consumed for messages between mp nodes
robot’s computational load will increase
Our system has a single point of failure; the VNC computer running roscore.
So we have to keep the VNC safe from whatever hostile environment the robots are exposed to.
Maybe this can be remedied in ROS2, which apparently does not rely on a single roscore.
Our patrol algorithm is very brittle and non-adversarial
Depends on AMCL and move_base, which do not tolerate even slight changes to the environment, and which do not deal well with moving obstacles.
There are no consequences to disturbing the patrol of the robots, or the robots themselves (e.g. no alarm or ‘arrests’)
The LEA is fairly modular, and the algorithm can be adapted to tasks other than patrolling.
It would work best for tasks which are clearly ordered by priority, and which are hazardous, or likely to cause robot failures.
Completely fanciful scenarios:
Drone strikes: Maybe certain military targets are ordered by priority, and drones can be assigned to them in a way that targets with a higher priority receive stubborn attention.
Planetary exploration: Perhaps some areas of a dangerous planet have a higher priority that need to be explored than others. Our system could be adapted so that the highest priority areas get patrolled first.
James Lee (leejs8128@brandeis.edu)
A special thanks to Adam Ring, the TA for the course, who helped with the project at crucial junctures.