Here’s a quick post — it should be a nice short one by my standards. This weekend I decided to upgrade a couple of my Ubuntu servers from 18.04 to 20.04. I ran into a bit of a problem with a really tiny cheap VPS that I keep mainly for playing around. It only has 256 MB of RAM and 5 GB of storage. It was an interesting challenge finding enough free disk space to complete the upgrade process to begin with, but that ended up being the easy part.

Read the rest of this entry

I’ve been involved a little bit with the process of porting RPiPlay to run on desktop Linux. RPiPlay is a program originally designed for the Raspberry Pi that acts as an AirPlay Mirroring server and supports mirroring your iOS device’s screen to your Raspberry Pi’s video out. Originally it only supported the Raspberry Pi, but antimof reworked the code to also work on desktop Linux with GStreamer, and I helped get it across the finish line and merged back into the main project.

A while ago, I noticed that when I ran RPiPlay in a VMware virtual machine during development, the video was messed up. It looked like some kind of horizontal synchronization issue. The image looked like it was stretching out further and further to the right on each successive line.

It worked fine on my laptop running Linux directly, which is probably the way most people use RPiPlay, so I didn’t think much more about it at the time. It bothered me though. It seemed to be a problem at a level deeper than RPiPlay, and I really wanted to understand why it was happening. So of course, I recently dug myself deep into a rabbit hole to try to figure it out.

Read the rest of this entry

I wanted to share a story of a segmentation fault I helped track down this weekend. I thought the final root cause of the segfault was interesting because of how unrelated it was to the code I was trying to debug.

I’ve been maintaining a Linux fork of obs-ios-camera-source, which is an OBS plugin that allows you to use an iPhone or iPad’s camera and microphone as a video and audio source in OBS. It works in conjunction with the “Camera for OBS Studio” app in the App Store. This kind of thing is useful for online streamers who want to use their phone’s camera instead of buying a separate camera. For those of you who don’t know, OBS is short for Open Broadcaster Software. A lot of streamers use it to handle broadcasting their stream. It allows you to capture audio and video, mix it all together, do all kinds of cool things with it, and then record the final result and/or stream it to sites such as YouTube and Twitch.

Getting this plugin working on Linux wasn’t really complicated, because it was already well-written without much platform-specific code. After all, the existing codebase was already operational on both macOS and Windows. It mostly just required tweaking a few compile/link options to make the code run happily on Linux.

Anyway, I’m pretty sure a good number of people have been using my Linux port of this plugin without issues. I know it works fine for me when I test with it in Ubuntu 18.04 or 20.04. I’ve helped people on other distros get it working too. I don’t really do any streaming myself — maybe someday though!

On Friday, GitHub user rrondeau reported an issue: after a half a year of the obs-ios-camera-source plugin working without a problem, it suddenly started causing OBS to segfault on his computer (currently running Fedora 33). He provided a stack trace that showed that the segfault was happening because of something initiated by the plugin. Afterward, he used GDB to get a better stack trace that provided more info about the functions being called and the parameters being passed:

#0  0x00007fffee7abc64 in socket_send () at /usr/lib64/samba/libsamba-sockets-samba4.so
#1  0x00007fff88b7813c in send_packet (sfd=50, message=8, tag=1, payload=0x1b22e60, payload_size=488) at /home/rrondeau/git/perso/obs-ios-camera-source/deps/libusbmuxd/src/libusbmuxd.c:400
#2  0x00007fff88b782a6 in send_plist_packet (sfd=50, tag=1, message=0x1ae53e0) at /home/rrondeau/git/perso/obs-ios-camera-source/deps/libusbmuxd/src/libusbmuxd.c:431
#3  0x00007fff88b7851b in send_list_devices_packet (sfd=50, tag=1) at /home/rrondeau/git/perso/obs-ios-camera-source/deps/libusbmuxd/src/libusbmuxd.c:499
#4  0x00007fff88b79367 in usbmuxd_get_device_list (device_list=0x7fffffffc740) at /home/rrondeau/git/perso/obs-ios-camera-source/deps/libusbmuxd/src/libusbmuxd.c:938
#5  0x00007fff88b725e1 in portal::Portal::addConnectedDevices() (this=0x1909378) at /home/rrondeau/git/perso/obs-ios-camera-source/deps/portal/src/Portal.cpp:109
#6  0x00007fff88b72684 in portal::Portal::reloadDeviceList() (this=0x1909378) at /home/rrondeau/git/perso/obs-ios-camera-source/deps/portal/src/Portal.cpp:126
#7  0x00007fff88b722db in portal::Portal::Portal(portal::PortalDelegate*) (this=0x1909378, delegate=0x1909240) at /home/rrondeau/git/perso/obs-ios-camera-source/deps/portal/src/Portal.cpp:57
#8  0x00007fff88b67053 in IOSCameraInput::IOSCameraInput(obs_source*, obs_data*) (this=0x1909240, source_=0x1aee000, settings=0x19210a0)
    at /home/rrondeau/git/perso/obs-ios-camera-source/src/obs-ios-camera-source.cpp:74
#9  0x00007fff88b66358 in CreateIOSCameraInput(obs_data_t*, obs_source_t*) (settings=0x19210a0, source=0x1aee000) at /home/rrondeau/git/perso/obs-ios-camera-source/src/obs-ios-camera-source.cpp:371
#10 0x00007ffff6259c2a in obs_source_create_internal () at /lib64/libobs.so.0
#11 0x00007ffff626bb81 in obs_load_source_type () at /lib64/libobs.so.0
#12 0x00007ffff626e3c2 in obs_load_sources () at /lib64/libobs.so.0
#13 0x000000000049e750 in OBSBasic::Load(char const*) (this=0xa370b0, file=0x7fffffffd040 "/home/rrondeau/.config/obs-studio/basic/scenes/Untitled.json")
    at /home/rrondeau/git/perso/obs-studio/UI/window-basic-main.cpp:973
#14 0x00000000004a2976 in OBSBasic::OBSInit() (this=0xa370b0) at /home/rrondeau/git/perso/obs-studio/UI/window-basic-main.cpp:1783
#15 0x000000000047feff in OBSApp::OBSInit() (this=0x7fffffffd690) at /home/rrondeau/git/perso/obs-studio/UI/obs-app.cpp:1415
#16 0x0000000000482503 in run_program(std::fstream&, int, char**) (logFile=..., argc=1, argv=0x7fffffffdd68) at /home/rrondeau/git/perso/obs-studio/UI/obs-app.cpp:2052
#17 0x0000000000484203 in main(int, char**) (argc=1, argv=0x7fffffffdd68) at /home/rrondeau/git/perso/obs-studio/UI/obs-app.cpp:2697

The actual segfault was happening inside of a function called “socket_send” in libsamba-sockets-samba4.so, which was being called by a function in libusbmuxd, which is bundled as part of the obs-ios-camera-source plugin source code and is used for communicating with iOS devices over USB. When I first saw this in the stack trace, my mind thought “Huh…that’s weird. Why does libusbmuxd use Samba’s library for its socket code instead of providing its own?” (Samba is an implementation of the Windows file sharing protocol used by pretty much every Linux distribution)

I tested and couldn’t reproduce the issue in Ubuntu. I know basically nothing about Fedora, but I faked my way through grabbing a Fedora 33 virtual machine, installing OBS, and compiling the plugin. I ran into the exact same issue that he was seeing.

Before I had a chance to look deeper and understand what was going on, rrondeau beat me to the correct conclusion: code in Samba’s library was mistakenly being called. libusbmuxd has a function called socket_send, but clearly libsamba-sockets-samba4’s function that is also named socket_send was accidentally being called instead.

Honestly, that’s all we really needed to know. Renaming libusbmuxd’s socket_send function to something else, and updating all references to it to use the new name, fixed the issue. I still wanted to understand why this suddenly became an issue when it had been working fine prior to that. Why were we calling into Samba libraries? Why does an iOS USB multiplexing library even consider talking to a library associated with Windows file sharing?

Not knowing the answer to that question bothered me. I decided to dig deeper and understand exactly what was going on. I started by using ldd, which lists all dynamic libraries used by a program or library:

[fedora@fedora33 build]$ ldd obs-ios-camera-source.so 
	linux-vdso.so.1 (0x00007fffa599a000)
	libobs.so.0 => /lib64/libobs.so.0 (0x00007f0a3f688000)
	libavcodec.so.58 => /lib64/libavcodec.so.58 (0x00007f0a3e2db000)
	libavutil.so.56 => /lib64/libavutil.so.56 (0x00007f0a3e036000)
...
	libsamba-sockets-samba4.so => /usr/lib64/samba/libsamba-sockets-samba4.so (0x00007fd6af4b7000)
...

I truncated the output because it spit out a very long list of libraries. As we can see from ldd’s output, obs-ios-camera-source.so depends on libsamba-sockets-samba4.so. ldd lists all recursive dependencies as well, and I couldn’t find any references to “samba” in the plugin source code, so this was likely an indirect dependency instead. I confirmed this by using readelf to show only the direct dependencies:

[fedora@fedora33 build]$ readelf -d obs-ios-camera-source.so | grep NEEDED
 0x0000000000000001 (NEEDED)             Shared library: [libobs.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libavcodec.so.58]
 0x0000000000000001 (NEEDED)             Shared library: [libavutil.so.56]
 0x0000000000000001 (NEEDED)             Shared library: [libobs-frontend-api.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]

At this point I used ldd and readelf to walk through the tree of dependencies and figure out what was actually linking against the Samba libraries. I later learned that I could have installed lddtree (part of the pax-utils package) to do this automatically. Either way, this led me to discover that the Samba libraries were being included through libsmbclient, which was a dependency of libavformat (part of FFmpeg). libavformat is a dependency of libobs.

Repeating this experiment on Ubuntu showed that libavformat on Ubuntu does not depend on libsmbclient. This explains why I couldn’t reproduce the issue on Ubuntu. So why does Fedora’s (well, RPM Fusion‘s) version of libavformat depend on libsmbclient?

It turns out that it’s a compile-time option for FFmpeg. libavformat contains code for talking with Windows servers using libsmbclient, but it’s an optional thing that you can choose to enable at compile time. Clearly Ubuntu chooses not to enable it, but RPM Fusion does. Actually, I found the exact post on RPM Fusion’s commits mailing list where the patch was added for enabling SMB support in FFmpeg. This patch is what led to the whole issue happening. If Ubuntu’s version of FFmpeg was being built with SMB support, we would have seen this a long time ago. This commit to RPM Fusion was made on December 31, 2020, which explains why rrondeau had only recently begun seeing the problem.

The root cause here is that the obs-ios-camera-source plugin was linking against two libraries that both provided a function named socket_send: libsamba-sockets-samba4 (indirectly through libobs) and libusbmuxd. libusbmuxd was being linked statically, but that doesn’t prevent functions in it from being resolved through dynamic linking rules anyway. So even though libusbmuxd was a static library with its own internal implementation of socket_send, it was using libsamba-sockets-samba4’s implementation instead.

rrondeau and I settled on changing what we had control over: the libusbmuxd source code embedded inside of the plugin’s source code. We went with simply adding a “usbmuxd_” prefix before all of the socket_ functions. There may be a more complex way of forcing it to use its own internal version of socket_send through linker options, but I feel that this is probably the simplest solution. It’s easy to implement, and it gets the job done.

This segfault turned out to be a pretty simple issue to solve and diagnose. Is it really worthy of a blog post? Maybe, maybe not. I could definitely foresee someone else running into this issue with another combination of libraries. socket_create, socket_close, socket_send, etc. are such generic names that it may happen again. This is a great opportunity to remind everyone: don’t use generic function names like this in your shared libraries, at least not in your exported symbols! You could easily run into a situation similar to this one. In my opinion, prefixes are definitely a good idea for your library’s exported symbols. In this case, both libusbmuxd and Samba were breaking that guideline.

This can be tricky because dynamic libraries on Linux export all symbols by default unless you specify otherwise. This is backwards from how Windows works with DLLs. Windows DLLs require you to specify which functions are being exported. I actually like that approach better! Here’s an interesting reference on how to customize the visibility of your Linux dynamic library’s symbols.

libusbmuxd already fixed this on their end quite a while ago — they now only export functions intended to be public, which have a usbmuxd_ or libusbmuxd_ prefix. I think the version included with the plugin’s source code is quite a bit older. For fun, I tried applying the visibility fixes from the linked patch to the plugin’s embedded libusbmuxd source code. The patches don’t apply cleanly because the embedded libusbmuxd code is actually built using CMake, so I have to add the compiler flags to CMakeLists.txt. After doing that, it does indeed cause libusbmuxd’s internal socket_send function to be called instead, and thus fixes the segfault.

What do you think? Would it make sense to try to convince the Samba project to rename their exported socket functions, or would I be barking up the wrong tree? I suspect that Samba’s socket library is actually intentionally exporting these functions so that other Samba libraries can call the socket functions. Would renaming Samba’s exported socket functions to give them less generic names cause a ton of incompatibilities given how long those function names have existed? Is it too late at this point? Am I wrong to think that Samba’s exported socket functions should have a “samba_” prefix or something like that?

I like my EarPods. Yeah, not the new fancy wireless ones. Just the standard wired earbuds that have come with iPhones for a long time. At one point I realized I prefer to use my EarPods during meetings. I was actually using my iPad instead of my computer to join meetings simply because I could use the EarPods.

Read the rest of this entry

I was looking at one of my classic Macs a few weeks ago, and noticed that my Ubuntu 18.04 netatalk server wasn’t showing up in the Chooser anymore. If you’re not familiar with netatalk, it’s an implementation of Apple Filing Protocol (AFP) that runs on Unix-like operating systems such as Linux and NetBSD. It allows other operating systems to act as Mac file servers. Version 2.x, which I use, supports the ancient AppleTalk protocol. This allows it to work with really old classic Macs that don’t even have a TCP/IP stack installed. Support for AppleTalk was removed in version 3.x, so that’s why I’m still using 2.x.

I checked out the server, and noticed that atalkd wasn’t running.

doug@miniserver:~$ ps ax | grep atalkd
3351 pts/0 R+ 0:00 grep --color=auto atalkd

Hmmm….why wouldn’t atalkd be running? I went ahead and tried to restart netatalk:

Read the rest of this entry

My favorite mouse is my Microsoft IntelliMouse Explorer 4.0. I bought my first one back in 2009. I love how the scroll wheel smoothly spins. I’ve never had another mouse like it. I want to keep using it forever and ever.

Read the rest of this entry

Last week, my main Linux computer died. It has an ancient Intel DX58SO motherboard from 2009 with an LGA 1366 CPU socket. A couple of years ago, I replaced its original Core i7-920 processor with a Core i7-980 from eBay. Considering its age, it’s actually a pretty powerful computer: six 3.33 GHz cores.

Anyway, here’s what happened. I was working, and just as I was about to join a meeting, I heard all of the fans in the computer stop spinning. The power LED remained on, but other than that, the machine looked like it was powered off. I tried power cycling it, but it was completely dead. After power cycling, the power LED wouldn’t turn on either.

Read the rest of this entry

, , , , , , ,

Long story short: Dell recently released a bad BIOS update (3.9.0) for the Inspiron 3650 that seemingly bricked people’s computers. Luckily somebody discovered an easy fix you can do yourself by changing a jumper on the motherboard. If you’re interested in hearing how I fixed it in a much more convoluted way before this info about the jumper was available, keep on reading.

Read the rest of this entry

A while ago, I put 16 GB of RAM into one of my computers. The computer is using a Foxconn P55MX motherboard with a Core i5 750. It’s old and could probably be replaced, but it still works for what I need.

Here’s the interesting part. This motherboard doesn’t officially support 16 GB of RAM. The specs on the page I linked indicate that it supports a maximum of 8 GB. It only has 2 slots, so I had a suspicion that 8 GB sticks just weren’t as common back when this motherboard first came out. I decided to try anyway. In a lot of cases, motherboards do support more RAM than the manufacturer officially claims to support.

I made sure the BIOS was completely updated (version 946F1P06) and put in my two 8 gig sticks. Then, I booted it up into my Ubuntu 16.04 install and everything worked perfectly. I decided that my theory about the motherboard actually supporting more RAM than the documentation claimed was correct and forgot about it. I enjoyed having all the extra RAM to work with and was happy that my gamble paid off.

Then, a few months later, I tried to boot into Windows 10. I mostly use this computer in Linux. I only occasionally need to boot into Windows to check something out. That’s when the fun really started.

Read the rest of this entry

I work on embedded devices that have the capability of installing a firmware update by plugging in a USB flash drive containing an update file. These devices can also save reports onto an attached flash drive. Historically, these devices have worked with the various drives I’ve been able to test in house, but there have been occasional reports of incompatible drives in the field. I just used the sample code provided with the microcontroller manufacturer’s USB library, so I had no idea what I could do to improve compatibility.

Sometimes the problem is outside of my control. If the drive is larger than 32 GB, Windows 10 formats it as exFAT, and I don’t currently have exFAT support enabled (it’s too expensive to license from Microsoft). But most of the time, the problem isn’t the filesystem. The problem is that each drive behaves a little differently.

I decided to dedicate some time to improving USB drive compatibility in the embedded devices I work on. I researched USB mass storage devices, the USB specification, and the SCSI protocol that flash drives speak over USB. I bought an old Ellisys USB Tracker 110b, which is capable of recording raw USB 1.1 traffic (this is suitable for my needs, because the embedded devices I work on are only capable of full speed). Then, I bought a ton of used flash drives on eBay. My goal was to try as many drives as possible and discover various quirks. I also got a USB 1.1 hub in order to limit the drives to full speed on modern computers, and recorded what happened when I plugged the drives into Windows XP, Windows 10, Linux, Mac OS X, and Mac OS 9. 

I was successful. I found lots of little differences in how USB drives work. The purpose of this post is to share my findings to help others in the future who might have trouble with USB flash drive support in their embedded products.

Specifications

Here is a list of links to relevant specifications that will be useful as a reference:

Brief overview of how USB mass storage works

Before I get into specific details, I want to start with a quick overview of how all of the protocols explained in the specifications above combine together to enable computers to communicate with flash drives over USB.

When a flash drive is plugged in, the computer looks at its device, configuration, interface, and endpoint descriptors to determine what type of device it is. Flash drives use the mass storage class (0x08), SCSI transparent command set subclass (0x06), and the bulk-only transport protocol (0x50). The specification indicates that this should be specified in the interface descriptor, so the device descriptor should indicate the class is defined at the interface level.

What does this all mean? It just means that there will be two bulk endpoints: one for sending data from the host computer to the flash drive (OUT) and one for receiving data from the flash drive to the computer (IN). Data sent and received on these endpoints will adhere to the bulk-only transport protocol specification linked above. In addition, there are a few commands (read max LUN and bulk-only reset) that are sent over the control endpoint.

The host starts out by sending a 31-byte command block wrapper (CBW) to the drive, optionally sending or receiving data depending on what command it is, and then reading a 13-byte command status wrapper (CSW) containing the result of the command. The CBW and CSW are simply wrappers around Small Computer System Interface (SCSI) commands. Descriptions of the SCSI commands are available in the last two specifications I linked above.

That’s all there is to it…except I haven’t said anything about which SCSI commands you’re supposed to use, or when. SCSI is a huge standard. Reading the entire standard document would take a ridiculous amount of time, and it wouldn’t really help you much anyway. Unfortunately, the standards don’t provide a section entitled “recommended sequence of commands for talking to flash drives over USB”.

This is where I originally hit a roadblock when I was implementing USB support, and it’s also why I simply stuck with the sample source code provided with the USB library I used. Unfortunately the sample source code was not good enough. What it comes down to in practice is you should try to do something similar to what Windows does, because pretty much every flash drive is compatible with Windows.

Initialization sequence

The first important thing to do when a USB flash drive is detected is to figure out information about it. Is it actually a flash drive? How big is it? How many logical units (LUNs) does it have? I found that if I didn’t follow a sequence with some preliminary commands that operating systems do, a SanDisk drive with a bunch of files on it would crash when I first attempted to write to it. Interestingly, the same drive didn’t have the same problem when it was empty. You may be thinking I’m an idiot and a problem like this obviously has to be in the filesystem library and not the USB library, but I swear that the problem was the USB communication to the drive itself, because imaging the drive to my computer with “dd” and using the same filesystem library on the raw drive image worked fine. The Ellisys USB Tracker confirmed the drive responded with a stall condition after the write, and after clearing the stall, it was hung up, even after a mass storage bulk-only reset command, which is supposed to prepare the drive to receive a new command.

Based on what I observed Windows, Mac OS X, and Linux doing, I changed my initialization sequence, and that problem completely went away. The sequence I do now is not an exact clone of any other OS, and it’s probably doing extra overkill commands, but hey, it works:

  1. Request the maximum LUN. If this request stalls, assume the maximum LUN is 0. Start working with LUN 0 in either case.
  2. Keep trying the sequence of “TEST UNIT READY” followed by “INQUIRY” until they both return success back-to-back. At this point you can look at the returned inquiry data to get more information about the name of the drive, if you care. For the inquiry, request 36 bytes. That’s what pretty much every OS does, so it’s best not to deviate from that.
  3. If the “INQUIRY” response data indicates that the peripheral device type is not 0 (meaning “direct-access device”), and there is more than one LUN, repeat step #2 with additional LUNs until you find one that is a direct-access device. Some promotional flash drives use LUN 0 as an emulated CD-ROM and the flash drive is on LUN 1, so you’d want to use LUN 1 instead of LUN 0 in that case. After this process is complete, use the LUN you discovered for all of the rest of your commands going forward. If you don’t find anything matching, just stick with LUN 0 in case the inquiry data is wrong.
  4. Attempt a “PREVENT ALLOW MEDIUM REMOVAL” command. A lot of operating systems do this, and most drives don’t support that command. It’s no big deal if the command fails. Just continue on. Interestingly I didn’t observe Windows XP sending this command on the drive I tested, but Windows 10, Mac OS X, and Linux did. I don’t know whether to include this command or not. It works for me.
  5. Keep attempting a “READ CAPACITY (10)” command until it succeeds. This will tell you the size of the drive in blocks (minus 1, because it returns the address of the last block), as well as the block size in bytes.
  6. Try a “MODE SENSE (6)” command, requesting 192 bytes of data on mode page 0x3F. The 192 matches what other operating systems request, so it’s best to match that. In the response data, if bit 7 of byte 2 is set, the drive is read-only. If the mode sense command fails, just move on. I haven’t found a drive that fails this command though.
  7. Just to be safe, do “TEST UNIT READY” again, repeating until it returns success. Now you are ready to send all the “READ (10)” and “WRITE (10)” commands you want to send.

In all of the places where I said to keep attempting something, I have a timeout of 5 seconds. If I don’t succeed and 5 seconds have elapsed in that step, I bail with an error. No drive I’ve tested so far has caused the 5 second timer to elapse, but if you’re extra worried you could try increasing the timeout.

Things to watch out for

As I tested various drives, I noted strange behaviors in certain cases. Here is a list of things you should probably watch out for.

Mass storage reset

The sample code that came with the USB library I use performed a “Bulk-Only Mass Storage Reset” command as the first step. None of the operating systems I tested did this, so I removed it. I think you should only use this command as a last resort if you have lost communication with the drive and it has stopped responding to your CBWs. (Make sure the drive isn’t simply waiting for you to read back a CSW after a stall or something too…)

Drives that are both a CD and flash drive

As I mentioned above, some promotional drives are both a CD-ROM and a flash drive. Check for that type of drive with the “Get Max LUN” command, and use the INQUIRY data on each LUN to find the one that’s actually the flash drive.

MODE SENSE (6) or MODE SENSE (10)?

While I was checking out various operating systems, I noted that sometimes Windows and Mac OS X use “MODE SENSE (6)” and sometimes they use “MODE SENSE (10)”. Linux seems to always do “MODE SENSE (6)”. I couldn’t figure out how Windows and OS X were making that determination.

I originally tried just always using “MODE SENSE (10)”, since I also always use “READ (10)” and “WRITE (10)”, so I figured why not the same with mode sense? However, that was a mistake. Some drives don’t support that command, and others return incorrect results in it. One drive I tested was particularly frustrating. Its “MODE SENSE (10)” response indicated that it was locked, even though it wasn’t. Its “MODE SENSE (6)” response correctly said it was not locked.

The moral of this story? Just stick with “MODE SENSE (6)”. Every drive I’ve tested supports it and returns correct-ish data. One drive I tested returns an incorrect data length as the first byte of the SCSI response data (the mode parameter header), so if the drive only returns 4 bytes but claims there are 70 in the response, you might want to limit your parser to only check the first 4.

As I said earlier, you should request 192 bytes of data in this command to match what other operating systems do. I’d recommend requesting mode page 0x3F, which means “all pages”. I’ve read online that some misbehaving drives may get confused if you send a mode sense request for any other page or data length.

First TEST UNIT READY command fails

On a lot of drives, the first “TEST UNIT READY” command returns failure. The sense data (obtained with “REQUEST SENSE”; see below) indicates it’s a temporary condition and to try again. On most of these drives, the next attempt succeeds. On one drive I tested, it failed the first 14 attempts.

This is why my initialization sequence says to keep trying “TEST UNIT READY” until it succeeds. I added “INQUIRY” in there as well because it seemed other OSes would intermix “INQUIRY” with it too. If after 5 seconds (or whatever time limit you’re comfortable with) it still hasn’t responded with success, then maybe something really is wrong.

Repeat failed commands

Maybe I should have just generalized the above section, but I’ll repeat it here. If a command fails that you really care about, just try again. I have it set up so if a “READ (10)” or “WRITE (10)” fails, I try again a few times before immediately bailing with an error.

In general, if an error occurs because the CSW indicates failure (and this rule of thumb also applies during the initialization sequence I described above), you should follow up with a “REQUEST SENSE” command to read information back about the failure before sending any other commands. Why? Because all the other operating systems do it too, so it’s a good idea. You are guaranteed that the drive has been tested under that behavior.

Theoretically the drive will tell you in the returned sense data whether the command failed due to a temporary condition or if the command is not supported. In practice, I do the “REQUEST SENSE” command and read the response from the device, but I ignore the content of the response. I simply repeat commands that I know are important, and I live with failure and move on if they’re not important (e.g. “PREVENT ALLOW MEDIUM REMOVAL”).

Write delays

This is probably obvious, but I’m pointing it out anyway. Sometimes “WRITE (10)” operations take a while to complete. When this happens, the drive will respond with NAKs until it’s finished. The NAKs could occur at any point — maybe while trying to read back the CSW, or while sending the next CBW, or in the middle of the data transfer process. If you’re designing a communication protocol that receives data and writes it to disk, make sure it has the ability to pause if the disk is too slow to keep up.

Handling short responses

In some cases, you will request data from the flash drive, but it will respond with less data than you requested. A perfect example of this situation is the “MODE SENSE (6)” command when requesting 192 bytes of data. I haven’t found a flash drive yet that has 192 bytes of mode pages to respond with.

According to the bulk-only transport specification, if the host requests more data than the device can provide in this situation, it’s allowed to pad the response with extra data to match the requested length. If it doesn’t do this, it must stall the BULK IN endpoint after transmitting as much data as it can. I was able to observe different drives that implemented each of these behaviors.

It turns out that there are some flash drives that ignore the above requirement and do it a different way. They don’t pad the response with extra data, and they don’t stall the endpoint. They simply send as much data as they can, without stalling the endpoint afterward. Although these devices aren’t following the bulk-only transport specification correctly, it’s not too hard to handle this situation. If you receive a USB data packet shorter than the endpoint’s maximum packet size (a “short packet”) when reading a response, but you haven’t received all the data you expected, you know the response has been terminated early by the drive, and the next read you attempt will give you the CSW. The USB library I use didn’t handle this case properly and would hang when trying to read a “MODE SENSE (6)” response from a misbehaving drive. It kept reading after the short packet. The next packet was the CSW, which it thought contained more mode sense data, and then it hung waiting for the rest of the mode sense data to arrive. I fixed the library to look for short packets and terminate the transfer early.

Because of this situation above, you might not realize a well-behaved drive has correctly stalled the endpoint until you try reading the CSW. So if you notice a stall condition after attempting to read back a CSW, clear the stall and try again. Note that the specification specifically mentions that retrying a CSW read after a stall is allowed. I suspect they allow it because of situations like the one I just described.

Safely ejecting

Some drives I tested didn’t necessarily finish saving changes before they were unplugged. For example, if I wrote data to a temporary file, and then renamed the file as my last write operation, plugging the drive into the computer showed that the rename operation hadn’t completed successfully, even though I had definitely asked the filesystem library to unmount the filesystem.

This was caused because the drive in question (a Samsung flash drive) implements caching, and I hadn’t told the drive to flush its cache to disk.

There is probably a complicated way to set up the cache properly. I believe one of the mode pages optionally returned by the “MODE SENSE (6)” command contains cache information which you might be able to configure. I came up with a simpler solution that seems to work. I send a “SYNCHRONIZE CACHE (10)” command after I’m done and have told the filesystem library to unmount the filesystem. This fixed the issue with the Samsung drive. Some drives don’t support this command and return an error, but it doesn’t seem to hurt anything.

Unsupported commands

I have read online that some poorly-designed flash drives will stop responding if you send them a command they don’t support. I haven’t observed any such drives in action. If you are concerned about this, you might want to consider removing the “PREVENT ALLOW MEDIUM REMOVAL” command from the initialization sequence, because I’ve seen many drives that don’t support it. You might also want to find a safer way of flushing the write cache than what I chose. In my use case, the very last command I send prior to the drive being ejected is the “SYNCHRONIZE CACHE” command. If that command is unsupported and causes the drive firmware to crash, the drive probably doesn’t support caching anyway, so I can assume the data is already safely written to the disk and the user is about to unplug it anyway.

Final thoughts

If at all possible, get a hardware USB analyzer. They’re typically expensive, but they give you so much detail about everything that’s going on. I was lucky enough to find one on eBay for a reasonable price. I can’t imagine that I would have been able to do this level of troubleshooting without one.

Even if you don’t have a hardware analyzer, if you follow some of my suggestions above, your device will be much more likely to be compatible with all of the random flash drives your end users happen to try out.

For 100% compatibility, it would probably be best to try to replicate exactly what Windows does when a drive is plugged in. However, doing that is difficult. It’s sometimes hard to understand why Windows sends the commands it does.