Volumes of trouble
Disclaimer Juno + CentOS7 (and CentOS6 for the cinder-volume host). Currently this service serves LVM backed cinder volumes over iSCSI.
So, a few days ago we had our customers contacting us. They seemed to have problems with cinder volumes. Some new volumes just didn't attach to VMs. They went to "attaching" and then back to "available". After that, when trying to delete the volumes, they went to the "Error deleting" state.
Ok, put on the debug gloves. Let's see. Cinder API logs seem to have entries about rabbit message timeouts. Hmm is rabbit being flakey again? These issues are hard to debug, but no, it didn't seem to be rabbit. Other services were fine, and the problem was too consistent to be rabbit's fault.
After that we did some more digging trying to find a common denominator for the problems. Ah, hidden in the flood of debug logs for our cinder volume on one of our cinder volume hosts, I found this (again, in retrospect I should have found it sooner).
2015-12-15 14:45:28.718 31835 ERROR oslo.messaging.rpc.dispatcher [req-f6179d9f-094d-4daf-822b-625bfc943ddc username tenantid - - -] Exception during message handling: 'ascii' codec can
't decode byte 0xa8 in position 15466: ordinal not in range(128)
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher File "/opt/venv-cinder/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher File "/opt/venv-cinder/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher File "/opt/venv-cinder/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher File "/opt/venv-cinder/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher return f(*args, **kwargs)
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher File "/opt/venv-cinder/lib/python2.7/site-packages/cinder/volume/manager.py", line 902, in initialize_connection
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher volume)
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher File "/opt/venv-cinder/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher return f(*args, **kwargs)
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher File "/opt/venv-cinder/lib/python2.7/site-packages/cinder/volume/drivers/lvm.py", line 548, in create_export
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher return self._create_export(context, volume)
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher File "/opt/venv-cinder/lib/python2.7/site-packages/cinder/volume/drivers/lvm.py", line 560, in _create_export
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher self.configuration)
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher File "/opt/venv-cinder/lib/python2.7/site-packages/cinder/volume/iscsi.py", line 61, in create_export
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher conf.iscsi_write_cache)
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher File "/opt/venv-cinder/lib/python2.7/site-packages/cinder/brick/iscsi/iscsi.py", line 241, in create_iscsi_target
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher LOG.debug("Targets after update: %s" % out)
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher File "/opt/rh/python27/root/usr/lib64/python2.7/logging/init.py", line 1421, in debug
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher msg, kwargs = self.process(msg, kwargs)
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher File "/opt/venv-cinder/lib/python2.7/site-packages/cinder/openstack/common/log.py", line 310, in process
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher msg = six.text_type(msg)
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher UnicodeDecodeError: 'ascii' codec can't decode byte 0xa8 in position 15466: ordinal not in range(128)
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher
2015-12-15 14:45:28.722 31835 ERROR oslo.messaging._drivers.common [req-f6179d9f-094d-4daf-822b-625bfc943ddc username tenantid - - -] Returning exception 'ascii' codec can't decode byte
0xa8 in position 15466: ordinal not in range(128) to caller
Now that's a head-scratcher. Non-ascii characters while handling RPC messages, and this makes cinder volume throw an exception? No wonder there are message timeouts. What on earth could cause this though? This line might give a hint.
2015-12-15 14:45:28.718 31835 TRACE oslo.messaging.rpc.dispatcher LOG.debug("Targets after update: %s" % out)
Targets after update? Uhm, ok. After some time searching for tgt commands and correct grep syntax, my colleague pasted this command to find non-ascii characters in the tgtadm output (this was not my first try at getting this info, but I'll spare you my bash history). As a side note, if you used the tgt-admin command, it actually converted the non-ascii chars to ascii chars, so you couldn't grep for them.
tgtadm --op show --mode target | grep --color='auto' -P -n "[\x80-\xFF]" -C 100
466-Target 19: iqn.2010-10.org.openstack:volume-<volid>
467- System information:
468- Driver: iscsi
469- State: ready
470- I_T nexus information:
471- I_T nexus: 664
472- Initiator: iqn.1994-05.com.redhat:a6cffabd21c1
473- Connection: 0
474: ��� LUN information:
475- LUN: 0
476- Type: controller
477- SCSI ID: IET 00130000
478- SCSI SN: beaf190
479- Size: 0 MB, Block size: 1
480- Online: Yes
481- Removable media: No
482- Prevent removal: No
483- Readonly: No
484- Backing store type: null
485- Backing store path: None
486- Backing store flags:
487- LUN: 1
488- Type: disk
489- SCSI ID: IET 00130001
490- SCSI SN: beaf191
491- Size: 214748 MB, Block size: 512
492- Online: Yes
493- Removable media: No
494- Prevent removal: No
495- Readonly: No
496- Backing store type: rdwr
497- Backing store path: /dev/vg_cinder_02/volume-<volid>
498- Backing store flags:
499- Account information:
500- ACL information:
501- ALL
Uhm, what was that? For one volume, the iSCSI connection information returns some random data where it should have the IP info for the connection. Ok, let's see.
tgtadm --lld iscsi --op show --mode conn --tid 19
Session: 664
Connection: 0
Initiator: iqn.1994-05.com.redhat:a6cffabd21c1
IP Address: 10.100.3.14
Well that's strange, it.. actually doesn't? One command gives garbled data, another one doesn't. We did some digging after this, and really didn't find anything on the tgt side we could debug to solve the issue.
What we ended up doing was powering off the VM where this troublesome volume was connected. Then we logged out from the iSCS session on the target machine to get rid of the connection. After that we started up the VM again.
It worked. There was a new iSCSI connection without errors. No more strange garbled tgt data. And our volume service also started working as normal. Now we just needed to clean up the load of "error deleting" volumes.
We still have no idea what the root cause for this was, and we possibly never will.
Geek. Product Owner @CSCfi