ECE 5760: Gesture-controlled iPod/iPhone music dock

Introduction

For our ECE 5760 final project, we felt it would be interesting to build a gesture-controlled iPod/iPhone music dock. In our imagined setting, a user with a green-tipped wand waves patterns at a composite TV camera connected to the music dock. These patterns are then interpreted by the dock and translated into device playback actions on the iPod/iPhone. Current track, artist and album information, as well as the iDevice name, is displayed on a VGA output screen and refreshed in real time.

Our project was built entirely in Verilog, without NiosII support.

Design

Figure 1

The high level design diagram is shown in Figure 1. A video camera captures gestures that are then translated into commands to be issued to the iPod/iPhone (which for simplicity we will subsequently refer to generically as 'iPhone'). For our project, we used the Altera DE2 FPGA board to process the video and communicate with the iPhone. There are three principal components in our design: an iPhone controller module, a VGA controller and a camera input system. The iPhone controller is responsible for the I/O with the iPhone via a TTL serial interface. The VGA controller outputs a video containing a composite of the input camera image, iPhone metadata and gesture tracking information.

Hardware

A detailed architecture of our gesture-controlled music dock is displayed in Figure 2. Our iPhone is connected to the Altera DE2 board via a 30-pin connector that permits serial communications between the FPGA and the phone.

Figure 2

The steps we took to generate iPhone commands from input video is as follows:

1. Capture the NTSC video from the video camera into SDRAM.
2. Read video from SDRAM and convert it into YCbCr color space.
3. A threshold algorithm is run to select green regions from the YCbCr video stream. These regions are replaced by black and sent on to the VGA controller. 4. The VGA controller, which receives the thresholded input video, performs noise rejection and gesture recognition. It also interfaces with the iPhone controller module to perform I/O on the device. 5. The resultant data is displayed onto the VGA screen. The VGA controller uses a multiplexer to select from SDRAM data (corresponding to the video camera feed) or an M4K partial frame buffer (used to store information about the current audio track), depending on the row requested by the controller. In our setup, rows 0-399 correspond to the video feed and rows 400-479 are used to display track metadata.
6. Gestures, if recognized, are translated by the VGA controller and their respective commands are dispatched to the iPhone controller. The iPhone controller is responsible for handling the TTL serial I/O with the actual iDevice. Commands that the iPhone controller wishes to send to the device are then placed into the transmit register of the RS232 module. Correspondingly, responses from the device are stored in various state-holding registers within the iPhone controller. These states are accessible from the VGA controller, which allows the latter to display track information.

The following subsections describe various technical aspects of our work in greater detail.

Threshold calculation in the YCbCr color space

YCbCr is a color space that separates the luminance (ie. light intensity) component of an image from the hue components. Working in this color space is advantageous over the RGB color space because colors in YCbCr are represented independently from their relative brightnesses. As such, it is easier to perform threshold color detection. In our design, we converted green colors that were over a certain variable threshold (settable via the DE2 board hardware flip switches) into a nearly black RGB color (r=0,b=0,g=1). Other pixels were directly converted into RGB without modification. We picked this specific RGB color to differentiate thresholded pixels from black pixels that could be the result of camera defects. The modified RGB frame is then forwarded to the VGA controller, which will interpret RGB r=0,b=0,g=1 colors as candidate pixels for gesture recognition.

Partial frame buffer for audio metadata

To display information such as the track, artist and album name on the screen, we built a partial frame buffer out of M4K memory blocks. This is a partial frame buffer because it consists of only 80 rows of 1-bit pixel data, for a total of 640*80*1 = 51200 bits. The M4K memory block we used had two ports: one for reading by the VGA controller and one for writing by the character renderer module.

Character renderer module

The character renderer module (called print_char in our project) abstracts away the messy details of blitting text onto a frame buffer. Each text character is represented by an 8x8 bitmap. This renderer module presents a simple interface: it takes a pair of text coordinates, an ASCII value to display and a frame buffer. It then blits appropriate bits onto the frame buffer so as to provide the illusion that the frame buffer is divided into discrete cells for printable characters. Internally, the character renderer module consults a lookup table that holds the bitmap values for the printable characters, then iteratively sets or unsets the relevant bits in the frame buffer, correctly converting between text coordinates and pixel coordinates at each step.

Periodic metadata frame buffer refresh

A finite state machine controls the periodic refresh of the audio metadata onto the VGA screen. Whenever the iPhone controller receives any displayable metadata, a valid signal is sent to this FSM, which then redraws the partial frame buffer with the updated text (ie. track, artist and album name, along with the iDevice name).

VGA controller

Our VGA controller was an adaptation of an existing VGA controller from the ECE5760 website. Our modifications were designed to perform 4 additional functions beyond drawing to the screen: (1) green centroid calculation, (2) mean filtering of the centroid for noise reduction, (3) gesture recognition, and (4) periodic iPhone metadata polling. These functions are described in the following subsections.

Green centroid calculation

To calculate the green centroid, we find the maximum and minimum x and y pixel coordinates that were marked as green from the annotated output of the YCbCr module. In other words, our simple algorithm takes the minimum bounding box that encompasses all pixels marked as green. The centroid is then the coordinate corresponding to the middle of the rectangle, ie. (Min_X + Max_X) >> 1 and (Min_Y + Max_Y) >> 1. This centroid can be rather jittery between frames because of noise in the input, so we we perform mean filtering on the centroid.

Mean filtering for noise reduction

Noise in the centroid location can adversely affect the gesture recognition algorithm by hopping around gesture hotspots when they were not intended by the user. To prevent this jitter from causing gesture mispredictions, we compute the mean centroid over a history of the most recent 8 frames. To do this, we maintain a 8 position circular shift register to store the centroid coordinates history. The mean centroid is then the unweighted average of the centroid coordinates. While naively simple as a solution, in practice this yields acceptable results. We also attempted other means of noise rejection by detecting outliers across successive frames, however it did not provide much benefit and in fact became a liability when tracked regions move very quickly across the camera.

Gesture recognition

For gesture recognition, we designated a portion of the input video as an active region. This active region does not occupy the entire frame, as we anticipated that users will want to see where they are on the camera before performing gestures. This active region spans 200x200 pixels on the VGA screen and is divided into 4 quadrants measuring 100x100 pixels each. The quadrants are marked on the VGA screen and a tracked object that wanders into the active region will cause the color of the relevant quadrant to become inverted, thus providing visual feedback to the user. A diagram of our active region is shown in figure 3.

Figure 3

A gesture is defined as a consecutive traversal from an inactive region to one or more quadrants within the active region, and finally out of the active region. Thus, after entering the active region, gestures are not finalized until the user completely moves the tracked object outside. Gestures are stored as a finite history of quadrant traversals, with capacity for up to 4 quadrants. Figure 4 depicts the array we use to store gestures.

Figure 4

After a complete gesture has been registered, the array is sent to a lookup table for command translation. To prevent accidental events, gestures containing only one quadrant history are discarded.

Command	Intuitive gesture	Acceptable quadrant history
Play/Pause	down-to-up swipe	3, 1 or 4, 2
Stop	up-to-down swipe	1, 3 or 2, 4
Next track	left-to-right swipe	1, 2 or 3, 4
Previous track	right-to-left swipe	2, 1 or 4, 3

The high-level finite state machine for gesture recognition is shown in figure 5. The gesture recognition FSM waits for the refresh FSM (described in the next subsection) to be ready before it will register any quadrant history. When a user moves a detectable object across the video camera, the FSM tracks and updates histories as the user transits the individual quadrants. Upon exiting the active region entirely, the FSM will look up the command corresponding to the quadrant history and issue an appropriate command to the iPhone controller.

Figure 5

Polling metadata from the iDevice

A refresh FSM is dedicated to polling metadata from the iPod/iPhone about twice every second. In each poll, information about the current track name, artist and title are retrieved from the device, along with the device name. At the end of the poll, it checks if a gesture command was pending. If so, the translated gesture command is sent to the device and the gesture recognition FSM is notified that the gesture was consumed and the system can thus begin to accept a new gesture. Figure 6 shows the state machine for the refresh FSM.

Figure 6

iPhone Controller

The iPhone controller module is responsible for I/O with the iPhone hardware and abstracting away device-specific details from other modules. The controller coordinates with the RS232/serial module and a precomputed command table held in block memory. To interface with other modules, it provides a command input line and an array of output lines that correspond to parsed data received from the phone.

RS232 to TTL serial controller

Our serial controller was based on the RS232 code from John Loomis (see references). This controller was designed to ‘bit bang’ the RS232 lines at a given baud rate implied by a clock divisor. Although we were able to verify that the code worked by communicating via a COM port on a computer, we were initially unable to get it to communicate with the iPod.

We discovered that the iPhone in fact relied on TTL serial logic, which is a different signaling standard from RS232. In TTL serial, a high bit is represented by +3.3V and low is represented by 0V, whereas a high in RS232 is +6V (measured) and a low is -6V. Thankfully, this signaling level difference did not cause damage to our iPhone although it did make communications impossible.

We were faced with a choice of implementing our own TTL serial controller by adapting the RS232 bit-banging code, or otherwise purchasing a RS232 to TTL serial converter. In the end, we devised a clever means to convert between the two signaling standards, by observing that the GPIO pins on the DE2 board also worked at 3.3V/0V levels. Thus, by copying high/low signal levels from the RS232 port to a GPIO pin, and mirroring GPIO input, we were able to communicate with the iPhone via GPIO pins. It took all of 2 Verilog lines of code to mirror the I/O and we were able to save ourselves time and money on the project.

iPhone accessory protocol and precomputed command sequences

The iPhone speaks with accessories at 19200 baud/8N1 using a well-established protocol. Although official Apple documentation on the topic is scant, a number of enthusiasts have reverse-engineered the protocol and published some unofficial specifications (see references). Even though these unofficial specifications are incomplete, they are enough to design a comprehensive iPhone dock with.

Although we do not detail all the specifics of the protocol here, we will summarize by stating that it operates with predictable structure. The iPhone communicates in frames of data. The framing format is similar in both sending and receiving mode: it starts with a 0xff 0x55 header, followed by a byte indicating the size of the frame, a byte indicating the mode of operation, two bytes (big-endian format) indicating the command type, and then a variable number of parameter bytes before a single checksum byte (see Request/Response structure in the iPod accessory serial protocol; link is in the references).

Our iPhone dock primarily operates in AiR (Advanced iPod Remote) mode. We observed that most commands in this mode were fixed by nature since they did not have variable parameters. Thus, with the exception of commands that depend on a track ID, all command sequences could be precomputed and stored in block memory. For example, the command to set the iPhone into AiR mode is always 0xff 0x55 0x03 0x00 0x01 0x04 0xf8. We mapped a simple numeric ID to each command sequence, and constructed a lookup table that translated between the numeric ID and the block memory offset corresponding to the command sequence.

Data received from the iPhone is parsed through a finite state machine that interprets the response and copies appropriate data into the relevant registers that are exposed from the module.

Pictures

Our FPGA, iPod and video camera setup.

Enjoying music passively in front of the camera.

Disgruntled neighbor wants some other song.

A close-up look at the metadata printed onto the screen.

The wand used for gesture detection. Expecto next track!

Code

A copy of our code can be obtained here. It is based off a previous ECE final project and may contain unrelated material.

Conclusion

We met all the objectives that we set out to accomplish. Although we could have made our lives simpler by designing the project in Nios-II, we decided to go the ‘hard’ way and implement this entirely in Verilog. Given the complexity of the project, we surprised ourselves at how we managed to build and test it in a relatively short time. Writing code at the Verilog level definitely gave us valuable insight into the low level hardware details of the DE2 board, and we felt a definite sense of (geek) engineering achievement when we delivered our product as promised.

References

1. (unofficial) Apple accessory protocol: link
2. RS232 code: link
3. 8x8 bitmap font set: link
4. Whack-a-mole project: link

Special thanks

1. Bruce Land for the awesome instruction both inside and outside of the classroom. Also, he has a lot of interesting anecdotes.
2. John Loomis for the RS232 controller. It was hard to find a good one online!
3. Adrian Game. We referenced your web material for the Apple accessory protocol even though there were other sources available.
4. OpenGameArt.org for the 8x8 bitmap font.
5. EnfigCarStereo.com for the iPod/iPhone cable.

About the authors

Mohit Modi is an M.Eng student at the ECE department. His interests are in reconfigurable computing and heterogeneous processor architectures. Click here for his LinkedIn profile and here for his personal website.

Zhiyuan Teo is a Ph.D candidate at the Computer Science department. His interests are in software-defined networking and rapid prototyping. Click here for his LinkedIn profile and here for his personal website.

ECE 5760: Final Project

Gesture-controlled iPod/iPhone music dock

Zhiyuan Teo (zt27@cornell.edu)

Mohit Yogesh Modi (mm2675@cornell.edu)