Planet Linux Australia

Syndicate content
Planet Linux Australia - http://planet.linux.org.au
Updated: 1 hour 41 min ago

Colin Charles: Premier Open Source Database Conference Call for Papers closing January 12 2018

Tue, 2018-01-02 22:01

The call for papers for Percona Live Santa Clara 2018 was extended till January 12 2018. This means you still have time to get a submission in.

Topics of interest: MySQL, MongoDB, PostgreSQL & other open source databases. Don’t forget all the upcoming databases too (there’s a long list at db-engines).

I think to be fair, in the catch all “other”, we should also be thinking a lot about things like containerisation (Docker), Kubernetes, Mesosphere, the cloud (Amazon AWS RDS, Microsoft Azure, Google Cloud SQL, etc.), analytics (ClickHouse, MariaDB ColumnStore), and a lot more. Basically anything that would benefit an audience of database geeks whom are looking at it from all aspects.

That’s not to say case studies shouldn’t be considered. People always love to hear about stories from the trenches. This is your chance to talk about just that.

Craige McWhirter: Resolving a Partitioned RabbitMQ Cluster with JuJu

Tue, 2018-01-02 16:44

On occasion, a RabbitMQ cluster may partition itself. In a OpenStack environment this can often first present itself as nova-compute services stopping with errors such as these:

ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager._sync_power_states: Timed out waiting for a reply to message ID 8fc8ea15c5d445f983fba98664b53d0c ... TRACE nova.openstack.common.periodic_task self._raise_timeout_exception(msg_id) TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 218, in _raise_timeout_exception TRACE nova.openstack.common.periodic_task 'Timed out waiting for a reply to message ID %s' % msg_id) TRACE nova.openstack.common.periodic_task MessagingTimeout: Timed out waiting for a reply to message ID 8fc8ea15c5d445f983fba98664b53d0c

Merely restarting the stopped nova-compute services will not resolve this issue.

You may also find that querying the rabbitmq service may either not return or take an awful long time to return:

$ sudo rabbitmqctl -p openstack list_queues name messages consumers status

...and in an environment managed by juju, you could also see JuJu trying to correct the RabbitMQ but failing:

$ juju stat --format tabular | grep rabbit rabbitmq-server false local:trusty/rabbitmq-server-128 rabbitmq-server/0 idle 1.25.13.1 0/lxc/12 5672/tcp 192.168.7.148 rabbitmq-server/1 error idle 1.25.13.1 1/lxc/8 5672/tcp 192.168.7.163 hook failed: "config-changed" rabbitmq-server/2 error idle 1.25.13.1 2/lxc/10 5672/tcp 192.168.7.174 hook failed: "config-changed"

You should now run rabbitmqctl cluster_status on each of your rabbit instances and review the output. If the cluster is partitioned, you will see something like the below:

ubuntu@my_juju_lxc:~$ sudo rabbitmqctl cluster_status Cluster status of node 'rabbit@192-168-7-148' ... [{nodes,[{disc,['rabbit@192-168-7-148','rabbit@192-168-7-163', 'rabbit@192-168-7-174']}]}, {running_nodes,['rabbit@192-168-7-174','rabbit@192-168-7-148']}, {partitions,[{'rabbit@192-168-7-174',['rabbit@192-168-7-163']}, {'rabbit@192-168-7-148',['rabbit@192-168-7-163']}]}] ...done.

You can clearly see from the above that there are two partitions for RabbitMQ. We need to now identify which of these is considered the leader:

maas-my_cloud:~$ juju run --service rabbitmq-server "is-leader" - MachineId: 0/lxc/12 Stderr: | Stdout: | True UnitId: rabbitmq-server/0 - MachineId: 1/lxc/8 Stderr: | Stdout: | False UnitId: rabbitmq-server/1 - MachineId: 2/lxc/10 Stderr: | Stdout: | False UnitId: rabbitmq-server/2

As you see above, in this example machine 0/lxc/12 is the leader, via it's status of "True". Now we need to hit the other two servers and shut down RabbitMQ:

# service rabbitmq-server stop

Once both services have completed shutting down, we can resolve the partitioning by running:

$ juju resolved -r rabbitmq-server/<whichever is leader>

Substituting <whichever is leader> for the machine ID identified earlier.

Once that has completed, you can start the previously stopped services with the below on each host:

# service rabbitmq-server start

and verify the result with:

$ sudo rabbitmqctl cluster_status Cluster status of node 'rabbit@192-168-7-148' ... [{nodes,[{disc,['rabbit@192-168-7-148','rabbit@192-168-7-163', 'rabbit@192-168-7-174']}]}, {running_nodes,['rabbit@192-168-7-163','rabbit@192-168-7-174', 'rabbit@192-168-7-148']}, {partitions,[]}] ...done.

No partitions \o/

The JuJu errors for RabbitMQ should clear within a few minutes:

$ juju stat --format tabular | grep rabbit rabbitmq-server false local:trusty/rabbitmq-server-128 rabbitmq-server/0 idle 1.25.13.1 0/lxc/12 5672/tcp 19 2.168.1.148 rabbitmq-server/1 unknown idle 1.25.13.1 1/lxc/8 5672/tcp 19 2.168.1.163 rabbitmq-server/2 unknown idle 1.25.13.1 2/lxc/10 5672/tcp 192.168.1.174

You should also find the nova-compute instances starting up fine.