Cornell University ECE4760
Direct Memory Access computing machine
RP2040

DMA on RP2040

DMA uses memory controllers separate from the CPU to accelerate data movment between memory locations, or between peripherials and memory. The RP2040 has 12 DMA channels which can stream an agregate of over 100 megabytes/sec without affecting CPU performance, in many cases. There are a huge number of options available to set up a DMA transfer. You can think of a DMA channel controller as a separate, programmable, processor with the main job of moving data. Memory on the rp2040 is arranged as a bus matrix with separate memory bus control masters for each ARM core and for the DMA system, and several memory bus targets accessed by the masters. Each bus target can be accessed on each machine cycle.

Here we use the DMA subsystem to produce a complete computing system, independent of the main ARM cpus. The DMA machine makes use of memory-copy ability, transport_triggered operations, and self-modifying code. The code consists of a sequence of DMA block descriptors stored in an array. The implemented operations are Turing Complete. About 8 million DMA blocks/second can be fetched/executed. There is a history of using only memory-moves to build a general cpu. In 2013 Stephen Dolan published x86 mov is Turing-Complete describing an example of a one-opcode machine. The paper Run-DMA by Michael Rushanan and Stephen Checkoway shows how to do this with one version (Raspbery Pi 2) of ARM DMA. The DMA system on the RP2040 has more transport-triggered functions and is a little easier to build. Joseph Primmer and I built a DMA processor using the Microchip PIC32 DMA system. Addition and branching had to be based on table-lookup. See DMA Weird machine.

The DMAcpu

The DMA machine is a fetch-execute cpu where the fetch function is done by one DMA channel, which loads DMA control block images from RAM into another (execute) DMA channel. The 'program' which is loaded consists of a carefully crafted series of DMA control blocks which together act as a general purpose computer. By using DMA1 blocks to modify following DMA1 control block images in the array, just before they are transfered to the hardware DMA1 control registers, we can perform addition, increments, conditional jumps, and/or/not logic operations, and any other operations required. The design is made easier by several transport- triggered actions in the DMA subsystem. These include an adder in the 'channel sniffer' and atomic SET/CLEAR/XOR write functions on all SFRs. The basic fetch/execute machine uses the channel DMA0 read address as a program counter. Every fetch that occurs leaves the read address pointing to the next block location. DMA0 reads the next block from the RAM array and copies it to the DMA1 channel hardware control registers, then chains to the newly loaded DMA1 channel. The DMA1 channel performs whatever daata move is specified, then chains to DMA2. DMA2 resets the DMA0 write_address to point to DMA1 control registers. Program branching is implemented by using DMA1 to load a new DMA0 read address to the DMA0 control registers. Writing a program of DMA blocks is very much like programming in some strange assembly language for a machine with one accumulator register and only memory-to-accumulator operations.

The following diagram is an attempt to summarize all this madness.
Black arrows are data flow. Blue arrows represent chaining between channels.
The control block array is just an array of ints that are read by DMA0 in sets of four.

In addition to a straight copy operation, there are a few transport-triggered operations in the RP2040 DMA system which happen as a side effect of reading or writing a specific address:

The programming process has to map these unusual primitive operations into familiar mathematical and logical operations, and some form of conditional branch or jump. The sniffer add operation and the bitwise SFR operations means we can directly implement these functions.
But remember that every basic operation is just a data move.

For add a sequence of DMA blocks could be:

  1. move one operand to the sniffer_data register
  2. move the other operand to the bit_bucket (discard) with sniffer enabled (this does the add)
  3. move sniffer_data register to the result address.

For shift-left we just do an ADD of a variable to itself (multiply by two).

For a logic operation (OR, AND, XOR, etc):

  1. move one operand to the sniffer_data register
  2. move the other operand to the sniffer SET, CLR, or XOR write address (e.g. DMA_SNIFF_DATA_CLR)
  3. move sniffer_data register to the result address.

For subtract of (A-B) we have to explicitly compute the 2's complement negative of B:

  1. move B to the sniffer_data register
  2. move 0xFFFFFFFF to the XOR write address ( DMA_SNIFF_DATA_XOR) to invert bits
  3. move unity to the bit_bucket (discard) with sniffer enabled (this adds 1 to from the 2's complement)
  4. move the A operand to the bit_bucket (discard) with sniffer enabled (this does an add)
  5. move sniffer_data register to the result address.

For a shift-right the process is much more annoying. A right-shift is a bit-reversed left-shift :

  1. move the variable to the sniffer_data register
  2. move the sniff_rev_mask to the DMA_SNIFF_CTRL_SET.
    This will cause a write from the sniff_data register to reverse the order of the bits in the word.
  3. move the sniff_data register to a temp_register (with bits reverse-order)
  4. move the temp_register back to sniff_data
  5. move the temp_register to the bit_bucket (discard) with sniffer enabled (doubling it; left-shift)
  6. move the sniff_data register to the result address (with bits reverse-order, restoring the correct order)
  7. move the sniff_rev_mask to the DMA_SNIFF_CTRL_CLR.
    This turns off the bit-reverse option

An unconditional jump is easy.
One step: move the jump target address to the DMA0 hadware read address control word.

The hardest operation to get right is a conditional jump. Every jump condition (e.g. jump on negative number) must be converted into an absolute address and all data possibilities (e.g. positive, zero, negative) MUST JUMP! This is because the last step of setting up the conditional jump is to push data to the DMA0 hardware. This weird constraint means that jump conditions need to be converted to small integers representing block addresses. I will outline the jump-on-negative-number-in-variable scheme.

  1. move variable to be tested to the sniffer_data register with DMA byte-swap turned on.
    This moves the sign-bit to bit 7. Bits 4-6 will be the same as the sign bit, as long as the absolute value
    of the register is less than pow(2,28).
  2. move 0xFFFFFFeF to the CLR write address ( DMA_SNIFF_DATA_CLR) to isolate bit 4.
    (Or any bits from 4 to 7). The result will be zero for a positive number (or zero) and 16 for a negative number, if you chose bit 4.
  3. move the desired address ADDR of a jump for positive input to the bit_bucket (discard) with sniffer enabled.
    The result will be an address of either ADDR or ADDR+16, with 16 being the size of one block in the program array.
    Each of these addresses may contain an unconditional jump to anywhere else in the program.
  4. move the sniff_data register to the DMA0 hadware read address control word to force the actual jump to one of the two locations.

Simplifying Programming with DMAasm

It gets tedious to build programs by building DMA control structures. One solution is to package up one or more channel definitions into macros with better notation and more intuitive names. What I have come up with so far is in the table below. The latest program in the list below will have implementations. Each of the macros in the table below resolves to one or more sequential DMA control structures that make the DMAcpu look like a 1980's style, single accumulator, memory-to-ALU, architecture. The transport-triggered functions of the DMA sniffer register make up most of the ALU. Most of the DMAasm statments execute in about 125 nSec (8 million/sec), but a vector add or a multiply can be much slower, on the order of 100 adds/uSec. Multiplying by 256 therefore costs over 2 uSec. Shift-right by more than one bit uses the same add mechanism and can be slower.

DMAcpu macro Use and function
nop() You guessed it! Skips one fetch/execute cycle.
gpio_out(gpio_num, state)
Sets or clears one gpio pin given by gpio_num.
The state is a variable which evaluates to a valid control word, e.g.
int pin_hi = 0x3300 ;
int pin_lo = 0x3200 ;
load_sniff(var_name)
Moves one 32-bit value from memory to the sniff_data register
store_sniff(var_name)
Stores one 32-bit value back to memory from sniff_data
store_sniff_bitrev(var_name)
The sniff_data register has the ability to bit-reverse a number as it is being stored.
This is occasionally useful, e.g. for the right-shift function
move(source, destination, count, width, source_inc, dest_inc)
Move count items of bit-width width from source to destination.
width must be DMA_SIZE_32 or DMA_SIZE_16 or DMA_SIZE_8
source_inc must have value var_read or array_read
dest_inc must have value var_write or array_write
If you move an array the source should be array_name[0]
move_bswap(source, destination, count, width

Move count items of bit-width width from source to destination
reversing the order of bytes

width must be DMA_SIZE_32 or DMA_SIZE_16 or DMA_SIZE_8
If you move an array the source should be array_name[0]
set_sniff_CRC32()
sets the sniffer enable function to perform CRC32 on data passing through it, using the current sniffer_data value as seed.
set_sniff_add()
sets the sniffer enable function to perform an cumulative add on data passing through it
crc32_sniff(var_name, count)
Do the actual crc32 operation, taking count data items from var_name
sniffer must be set to CSC32
add_sniff(var_name, count, source_inc)

Do the actual cumulastive add operation, taking count data items from var_name. This is a vector add.
sniffer must be set to add
. If the variable is an array, use array_name[0] as the source.
source_inc must have value var_read or array_read

neg_sniff()
Take the 2's complement of sniff_data
inv_sniff()
Take the logical not of each bit of sniff_data
clr_sniff(mask)

set_sniff(mask)

xor_sniff(mask)
Low-level SFR direct write functions in which the mask to indicates which bits to change.
or_sniff(var_name)
Logical OR of var_name into the sniffer
and_sniff(var_name)
Logical AND of var_name into the sniffer
shift_left_sniff()
One bit shift left

logical_shift_right_sniff()
logical_shift_right4_sniff()
logical_shift_right8_sniff()

One, 4, or 8 bit bit shift right - logical
arith_shift_right_sniff()
One bit shift right - arithmetic
mult_sniff(constant)
mult_sniff_var(var_name)
Multiply by a LITERAL constant.
multiply ONLY by relative small, CONSTANT, positive ints. e.g. 4 or 20, but NOT 1000000
-----
Mult by a variable, but remember that this is done by repeated addition and is slower for bigger numbers.
jump(block_addr)
Unconditional jump to
block_addr which must be an int* variable assigned to an address, usually using something like
label(extra_pulse_addr)
where the label macro sets the address of the following block
jump_neg(block_addr_neg, block_addr_not_neg)
conditional jump
jump to two diferent locations on EITHER negative or positive/zero

the argument adressses MUST be int* variables with block addresses

pacer(dreq)
Exection pacer. This just waits for the next DREQ event.
If the event is a DMA clock tick then
timer must be one of DREQ_DMA_TIMER0, DREQ_DMA_TIMER1, DREQ_DMA_TIMER2,  or DREQ_DMA_TIMER3
And the timer must be set up.
label(label_name) The label_name variable is assigned the address of the current block. Used for specifying jump and function addresses. The variable must be an int*
jump_link(function_addr)
jump_return()
Jump to the specified address, and store a return address in the link register.
Return from the function to the address in the link register.
ptr_to_value_sniff() This macro is used to dereference a pointer contained in the sniff_data register and return the value to sniff_data. Useful for converting array addresses into data.

 


The programs below are in blog-style reverse time order, newest stuff at the top.
The following program list is in time-order.

  1. Test program to validate basic execution model and test GPIO output, add, OR operation, conditional branch, and unconditional jump.(23dec2022)
  2. Direct Digital Synthesis is used to test timer-regulated execution speed, SPI output, and combining the DMA channel byte-swap function and CLR-masking to isolate the top byte of the 32-bit accumulator to use as an index into a sine-table. Insertion of the pointer to the sine table requires self-modifying code. Performance is good enough to use for audio synthesis rates. (28dec2022)
  3. Updated test program which implements add, subtract, shift-left, shift-right, and a couple of different ways of generating a conditional jump. (2jan2023)
  4. Refactored and generalized version. DMA channel dependencies are cleaned up for compatability with other software (e.g. VGA generation). The fetch/execute architecture is separated from the DMA program definition. (3jan2023)
  5. Use the DMAcpu machine to read the ROSC random-bit, shift it into the sniffer, then use the result to compute a CRC32 value using the sniffer hardware, and then output that to an SPI channel to make audio white noise. (7jan2023)
  6. The white noise generator was low-passed filtered using a 1-pole IIR, mostly just to see if the DMAcpu could do the arithmetic. It took a 38 step program about 4 uSec to compute a low-passed sample. (11jan2023)
  7. Merging the DMAcpu with VGA generation. Since both use the DMA system heavily, a test was necessary to see if either one broke when merged. Video also gave a way to visually test the random number generation quality. (11jan2023)
  8. Refining the DMAcpu random number generator and simulating Diffusion-Limited Aggregation (DLA). While testing the DLA, I noticed that there is some serial correlation in the DMAcpu random number generation. This code eliminates the correlation. (13jan2023)
  9. DMAasm test. This packages up the block definitions necessary to define a DMA program into a form that may be easier to read. (23jan2023)
  10. DMAasm test with multiply.(1feb2023)
  11. DDS with AM modulation to make a sine wave burst. (3feb2023)
  12. Noise generation using hardware CRC32, mixed with tone and AM modulated. (10feb2023)

DMAasm noise+tone burst. (10feb2023)
The DMA program generates noise samples and DDS samples at 40KHz sample rate, sums them, then AM modulates into a burst. The main C program sets up the DMA program, generates a seed for the DMA rnadom number generator, then drops into a serial thread that prompts for sine wave frequency, burst duration, burst repeat time, and relative noise amplitude. The images show a 400 Hz sine wave with medium noise, and just noise with no sine wave.

Code, ZIP


DMAasm tone-burst using DDS and amplitude envelope. (3feb2023)
This example tests many features of the DMAasm, while have a strict timing deadline of 40 KHz sample rate. The 88 DMA blocks generated by the assembler execute in between 5 and 13 uSec, well within the deadline time of 25 uSec. The tone bursts sound clean and glitch-free. Three code snippets show how the DMAasm is written. The first snippet is the DDS algorithm which increments the phase accumulator and gets a sine sample from a table. The second shows how the amplitude envelope is computing. The envelope is just a linear rise, constant, then linear fall. The third snippet shows the SPI output and undate of the burst timer. Two images show the DMAasm timing and a 1KHz burst. The sample compute duration (pulse-high) is approximately the longest sample execution time when the inefficient multiplies dominate the timing. There is exactly 25 uSec between pulses, set by the DMA pacing timer3. You can see the small steps in the output waveform as the DAC updates the voltage at the end of each pulse. It appears that the settling time of the DAC is <1 uSec.

Code, ZIP


DMAasm -- test of multiply, looping, subroutines (1feb2023)
The program just asks for two two test values and a loop increment maximum. It produces a pulse train and some printed numerical results. The pulse train depends on the loop maximum and the sign of the first test value.

Code, ZIP


DMAasm -- Using memonic macros to make DMA programming a little easier (23jan2023)
After writing a few programs, the most useful repeated constructions became more obvious and suggested that it should be possible to make easier to use macros which generate the DMAcpu program consisting of a list of blocks. As always, moving to a higher level means a little inefficiency, but it seems to be worth it. Each of the macros in the table below generates one or more sequential DMA control structures that make the DMAcpu look like a 1980's style, single accumulator, memory-to-ALU, architecture. The transport-triggered functions of the DMA sniffer register make up most of the ALU. The test program just exercises each of the DMAasm instructions, uses a serial thread to enter two test values and print results, and displays a normal distribution on the VGA to test the CRC32 code. GPIO2 is toggled. The pacing timer is set to 50 KHz. Entering a negative number at the serial command line for the first test value will exercise the conditional jump. There is support for a single-level of suboutine call (no stack).

Code, ZIP


DMAcpu and DLA, with refined random number generation. (13jan2023)
DLA runs for 100s of millions random number evaluations, rather than the 100s of thousands used to generate the distributions in projects below. The older random generator produced a slightly biased DLA, so I wrote a test program that just plotted sequential rands as points in 2D. A clear, but rare, diagonal line was produced. Introducing a slight delay decorrelated the ROSC bit, but slows down the random number generator to about one uSec. This speed is about the same as the C rand() function, but it is a true (as opposed to pseudorandom) random number generator. The two images use the improved DMAcpu random function. The left image has a one-pixel seed in the center of the screen. The image on the right has the text "ECE 4760" as a seed.

DLA code, ZIP ( also corellation test code, random distributions code)

DISCLAIMER! The ROSC is not shown by the manufacturer or by me to have any reliable level of random generation. Further, of the three rp2040's I have tested, each gives a somewhat different oscillator speed, and different distributions of bits. Do not use this for any critical project without doing your own tests!


DMAcpu and 256 color VGA. -- Distribution testing. (11jan2023)
Test programs from the random number generation page were converted to use the DMAcpu-generated random numbers. The ROSC ring oscillator rnd_reg is used to drive a CRC32 in the DMAcpu. The tests chosen were a 20-coin toss binomial distribution and summing several uniform random numbers to form a normal distribution. The serial interface allows the user to choose normal/binomial, a scale factor, and (for normal) the number of uniform numbers to be added to make one approximate normal sample. The images below show a binomial distribution with 1.3 million total events, with each event being the number of heads from 20 coin tosses. The normal distribution is built from 3 million events each consisting of the sum of 12 uniformly distributed numbers. The red dot and blue dots are the expected distributions. There may be slightly too few samples near the peak in the normal distribution. The DMAcpu program is just four blocks which perform a CRC operation on the ROSC random bit and the last CRC result, stores the result, signals the thread that a new value is read, then jumps back to the beginning block.

DISCLAIMER! The ROSC is not shown by the manufacturer or by me to have any reliable level of random generation. Further, of the three rp2040's I have tested, each gives a somewhat different oscillator speed, and different distributions of bits. Do not use this for any critical project without doing your own tests!

Code, ZIP


Filtering white noise using DMAcpu.(11jan2023)
The white noise generator was low-passed filtered using a 1-pole IIR, mostly just to see if the DMAcpu could do the arithmetic. It took a 38 step program about 4 uSec to compute a low-passed sample. I don't think this is a practical use for the DMAcpu, but it did test a lot of functions. The filter implemented is a simple one-pole IIR filter with a filter coefficient set by right shifting:
output = old_output + [(input - old_output) >> n]
It took a 38 DMA block program about 4 uSec to compute. Both channels are set to the DAC with the filtered output on channel A and unfiltered on channel B. Since my right-shift function only works with positive numbers, the actual function computed was:
output = old_output + (input>>n) - (old_output >> n)
Shown below are time domain and spectra waveforms for an n=4.
Top trace is the unfiltered noise in both the time domain, and the magenta spectra on the right.

Code, ZIP


Generating white noise from the DMAcpu. (7jan2023)
The program reads the ROSC random-bit and shifts it into the sniffer data register. This shift register is then used as a seed for CRC32 hardware computation and the resulting scrambled bits are truncated to 12-bits for the SPI DAC to produce good sounding white noise. The machine runs at a sample rate of 50 KHz in the example code, but will run as fast at 500 KHz. The image below shows the spectrum at a sample rate fo 100 KHz. The spectrum is down about 6 db at 50 KHz, but flat through the audio spectrum. The time to generate a new audio sample is about 0.9 uSec. (the timing code in the linked program, but not shown below, adds 0.3 uSec.)

The code shows the simplifed block syntax.
Note that sniffer function alernates between add and CRC32.
Also, sending a value through the sniffer twice, in add mode, doubles it (shift-left)
// dma_sniffer_ set to add: add function code is 0xf in the calc field
build_block(&sniff_calc_mask, DMA_SNIFF_CTRL_SET, 1, STANDARD_CTRL);
// load a random bit fom ROSC to sniff data reg: dma_hw->sniff_data
build_block(rnd_reg, &dma_hw->sniff_data, 1, STANDARD_CTRL);
// pass shift-var thru the sniffer twice to the bit_bucket
build_block(&dma_noise_temp, &bit_bucket, 2, STANDARD_CTRL | SNIFF_EN) ;
// store back to shift-var
build_block( &dma_hw->sniff_data, &dma_noise_temp, 1, STANDARD_CTRL);
// dma_sniffer_ set to CRC32: CRC32 function code is 0x0 in the calc field
build_block(&sniff_calc_mask, DMA_SNIFF_CTRL_CLR, 1, STANDARD_CTRL);
// compute CRC32 build_block(&dma_noise_temp, &bit_bucket, 1, STANDARD_CTRL | SNIFF_EN) ;
// dma_sniffer_ set to add
build_block(&sniff_calc_mask, DMA_SNIFF_CTRL_SET, 1, STANDARD_CTRL);
// limit to 12 bit data: mask value is 0xffff000
build_block( &sniff_dac_data_mask, DMA_SNIFF_DATA_CLR, 1, STANDARD_CTRL);
// OR in the DAC control word
build_block( &dac_config_mask, DMA_SNIFF_DATA_SET, 1, STANDARD_CTRL);
// send to DAC
build_block(&dma_hw->sniff_data, &spi0_hw->dr, 1, (DMA_CHAIN_TO(fix_chan) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_16) | DMA_IRQ_QUIET | DMA_EN));
// unconditional jump to start of program
// push the DMA_blocks[0] address into the program counter (fetch channel read pointer)
// !!NOTE that this block throttles the machine to the frequency of Timer 3 !!
// To run at full DMA speed, change to DMA_TREQ(DREQ_FORCE)
build_block(&DMA_blocks_addr, &dma_hw->ch[fetch_chan].read_addr, 1, DMA_CHAIN_TO(fix_chan) | DMA_TREQ( DREQ_DMA_TIMER3) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_IRQ_QUIET | DMA_EN) ;

Code, ZIP


Improved organization of the test program (3jan2023)
The explicit DMA channel numbers were replaced with macros defining the actual channels so that the machine can be more easily used with other DMA-based protocols. The fetch/execute architecture is separated from the DMA program definition to make it easier to modify for new DMA progams. A macro is added to make the DMA channel control register specification more compact and easier to read.

Code, ZIP


More operations and better conditional jumps (2jan2023)
The program asks the user for two numerical values then computes the sum, difference, shifted values and the sign of the first input value. Most of the explaination of this program is above, describing how each operation is decomposed into a series of data-moves.

Test program, project ZIP


An application: Direct Digital Synthesis (12/28/2022)
DDS is an attractive example because it requires fast, precise, timing, but is essentially a pointer increment followed by a table-lookup, then a 16-bit SPI load. In other words, mostly data motion. The DMA program takes to form of a linear set of instructions with no branching, just a loop back to the beginning of the program. Rather than running the DMAcpu at full speed, the loop back block is paced by one of the high precision DMA timers set to a fixed frequency. The program ran correctly up to the limit of the SPI DAC chip, 500,000 samples/sec, but for audio synthesis I used a lower frequency pace of 200,000 loops(samples)/sec. At that rate, the DMA machine ran about 25% of the time. The generated frequency matched the math to within the accuracy of my scope (about 0.1%).
Algorithm:

  1. dds_accum += dds_inc (32 bits) where dds_accum is the DDS phase accumulator
    and dds_inc is incremental speed of rotation of the phasor (proportional to the frequency)
    where: dds_inc = Fout * pow(2,32 )/ Fs ; with Fs = 2e5 and Fout the desired sinewave frequency
  2. The high byte of dds_accum becomes the index into sine table:
    Use DMA BSWAP to move it to low byte of sniffer data register
    clear upper bytes using the transport triggered CLR write with mask 0xffffff00
    Multiply by 2 to convert index into a byte-count of short ints
  3. add byte-count to the base address of the sine_table to form a pointer to the next entry
  4. Shove the pointer just computed into the NEXT BLOCK read address, so it can copy the table value to the SPI channel
  5. Do a 2-byte transfer from sine table to SPI_data register.
    The SPI transfer takes about 0.8 uSec, but if the pacing timer is set to 200 KHz no wait is necessary here.
  6. Stall waiting for pacing timer, then jump back to step 1

The DMA-machine program:

  1. Send a timing pulse to GPIO2. The length of the pulse will be the execution time of the loop.
    build_block(&pin_on, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;

  2. ADD increment to the accumulator. This is the phasor used to look up a sine value
    // === add dds_accum and dds_inc by transport-triggered operation in sniff reg
    build_block(&dds_accum, &dma_hw->sniff_data, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
    // pass dds_inc thru the sniffer to the bit_bucket
    build_block(&dds_inc, &bit_bucket, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | SNIFF_EN) ;
    // store the sniff data reg back to dds_accum
    build_block( &dma_hw->sniff_data, &dds_accum, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;


  3. Form pointer to next sine-table entry from the accumulator
    // load dds_accum to sniffer BUT byte reversed! see BSWAP
    build_block(&dds_accum, &dma_hw->sniff_data, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | BSWAP) ;
    // clear high bytes -- leave low byte alone -- the clear_high_bytes mask is 0xffffff00
    build_block(&clear_high_bytes, DMA_SNIFF_DATA_CLR, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN ) ;
    // mult by 2 for 'short' array pointer by addding sniffer to itself
    build_block( &dma_hw->sniff_data, &bit_bucket, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | SNIFF_EN) ;
    // add to sine table base address
    build_block(&sine_table_addr, &bit_bucket, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | SNIFF_EN) ;


  4. Move the just-formed sine table pointer into the NEXT BLOCK read address
    build_block(&dma_hw->sniff_data, next_block_addr, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN ) ;


  5. Move sine table entry to SPI connected to DAC-- spi0_hw->dr
    // NOTE that the read address is just a place-holder for the previous block to overwrite.
    // NOTE that the SPI CS line is driven automatically by the write to the SPI data reg
    build_block(sine_table_addr, &spi0_hw->dr, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_16) | DMA_EN ) ;


  6. Clear the timing GPIO pin
    build_block(&pin_off, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;


  7. Jump back to the beginning, but wait for the DMA pacing timer.
    // push the DMA_blocks[0] address into the program counter (DMA0 read pointer)
    // !!NOTE that this block throttles the machine to the frequency of Timer 3 !!
    // To run at full DMA speed, change TREQ to DREQ_FORCE
    build_block(&DMA_blocks_addr, &dma_hw->ch[0].read_addr, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_DMA_TIMER3) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_IRQ_QUIET | DMA_EN) ;

DDS program, ZIP


First DMA test program. (12/23/2022)
This program just toggles an i/o pin to trigger an oscilloscope, then runs through basic proof-of-concept constructions. This program performs several operations using only DMA-logic. the DMA machine is completely asynchronous and independent of MAIN, once started. MAIN sets up the DMA machine program, defines variables for the machine, then prints the results of an ADD and OR operation on the serial console. No other microcontroller resources are needed (except memory, of course) to make the machine run. The execution speed is about 8 million blocks/sec. To make life easier I defined a macro to insert DMA control blocks into the array defining the program.
build_block(read_addr, write_addr, count, ctrl)
builds a DMA control block according to the specs in the data sheet. Remember that control blocks are pulled one-at-a-time from the array, placed in the DMA1 hardware registers by DMA0, then triggered to perfrom the desired data move. After the move, DMA1 chains to DMA2 to reset the DMA0 write address to point to DMA1 control registers.

The DMA-machine program:

  1. Sends a two trigger pulses to GPIO2, for an oscilloscope, and to time the machine.
    // === set the GPIO2 pin by transfering a control word directly to the pad control register.
    The parameters on line 2 configure the DMA control register so that the channel runs as soon
    as possible, with a width of 32 bits, no increments, and chaining to channel DMA2 when done.
    The word transfered is 0x3300 which enables output and writes a 1.
    build_block(&pin_on, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;

    // === clear the pin
    The word transfered is 0x3200 which enables output and writes a 0.
    build_block(&pin_off, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;

    // === set pin -- repeat
    build_block(&pin_on, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;

    // === clear the pin -- repeat
    build_block(&pin_off, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;

  2. Adds two 32-bit variables and store the result back to a variable.
    Three DMA1 blocks:
    // === add two variables by transport-triggered operation in sniff reg
    // assumes: dma_sniffer_enable(1, sniffer_add, true);
    // === load a var to sniff data reg: dma_hw->sniff_data
    build_block(&dma_var_1, &dma_hw->sniff_data, 1, DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
    // == pass another var thru the sniffer to the bit_bucket--data has to pass thru for the add to work.
    // the bit_bucket is just a dummy variable to discard transfered data. the add occurs as the data passes
    // through the sniffer.
    // result is in sniffer_data register
    // notice the SNIFF_EN is set to turn on the add function for one block
    build_block(&dma_var_2, &bit_bucket, 1, DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | SNIFF_EN) ;
    // = = store the sniff data reg back to var_2
    build_block( &dma_hw->sniff_data, &dma_var_2, 1, DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;

  3. Computes the OR of two 32-bit variables and stores the result
    These operations use transport-triggered operations built into SFR to implement logic.
    Three DMA1 blocks:
    // === OR two variables by transport-triggered operation in sniff reg
    // = === load a var to sniff data reg: dma_hw->sniff_data
    build_block(&dma_var_3, &dma_hw->sniff_data, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
    // == load another var to the SET reg. EVERY SFR has SET, CLR, XOR

    build_block(&dma_var_4, DMA_SNIFF_DATA_SET, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;

    // == store the sniff data reg back to var_5
    build_block( &dma_hw->sniff_data, &dma_var_5, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;


  4. Multiply a variable by a constant.
    Three blocks:
    // === mult a variable by a constant in sniff reg
    // in this case, times 4
    // by substituting the '4' to a variable, you can do general mult
    // == clear sniff data reg: dma_hw->sniff_data (with no clear get MAC operation)
    build_block(&dma_var_0, &dma_hw->sniff_data, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;

    // == pass the var thru the sniffer to the bit_bucket 4 times. note SNIFF_EN is on
    build_block(&dma_var_6, &bit_bucket, 4,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | SNIFF_EN) ;

    // == store the sniff data reg back to var_2
    build_block( &dma_hw->sniff_data, &dma_var_7, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;


  5. Computes a conditional skip based on user input from a thread.
    Three DMA blocks to compute branch:
    // === conditional skip
    // the dma_flag variable can take only values 0, 16, 32, or 48 as set by user thread
    // these numbers correspond to jumping 0, 1, 2, or 3 blocks ahead.
    // == read flag to sniffer
    build_block(&dma_flag, &dma_hw->sniff_data, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;

    // == form target block address by adding jump on zero address; jump_zero_addr = block_addr(16) ;
    // had to count the blocks to find out the next one AFTER the DMA0 load was block number 16.
    build_block( &jump_zero_addr, &bit_bucket, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | SNIFF_EN) ;

    // move sniffer data to read addr of DMA0 to force next read from new location
    // sniffer contains zero_jump_address + offset to one_jump
    // == push the new block address to DMA0 block
    build_block(&dma_hw->sniff_data, &dma_hw->ch[0].read_addr, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_IRQ_QUIET | DMA_EN) ;


  6. The skip targets are in steps of 16 bytes/DMA block stored in the array.
    The targets just change the length of a pulse on GPIO2.

    // === TARGET if dma_flag == 0 -- THIS is block number 16 in the program list
    // == set pin
    build_block(&pin_on, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;


    // TARGET if dma_flag == 16
    // == set pin
    build_block(&pin_on, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;


    // TARGET if dma_flag== 32
    // === clear the pin
    build_block(&pin_off, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;


    // === TARGET if dma_flag == 48
    // === unconditional jump to start of program (#1 in this list)
    // push the DMA_blocks[0] address into the program counter (DMA0 read pointer)
    build_block(&DMA_blocks_addr, &dma_hw->ch[0].read_addr, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_IRQ_QUIET | DMA_EN) ;

Test code, ZIP


 

 


Not current below this line!

The DMA test program performs several operations using only DMA-logic. the DMA machine is completely asynchronous and independent of MAIN, once started. MAIN sets up the DMA machine program, defines tables for the machine, then prints the results of an ADD and NOR operation on the serial console. No other microcontroller resources are needed (except memory, of course) to make the machine run. The execution speed is about 200,000 blocks/sec.
The DMA weird-machine program:

  1. Sends a trigger pulse to PortA, for an oscilloscope
    One DMA2 block to transfer one byte to LATA.
  2. Increments a variable to be used in the arithmetic below
    Two DMA2 blocks:
    -- move variable value to low byte of source address field (increment array) of next DMA block.
    -- move value in source address to variable. Contents of increment array is (source addr low-byte value)+1.
  3. Adds two 8-bit variables and stores the result (terminal image showing sum, modulo 256, and result of NOR operation below)
    Four DMA2 blocks:
    -- move variable_1 value to low byte of source address field (increment array) of DMA block 3 blocks later.
    -- move variable_2 value to low byte of source length field (increment array) of DMA block 2 blocks later.
    -- move variable_2 value to low byte of cell length field (increment array) of DMA block 1 block later.
    -- increment through the table specified by the previous three blocks and store result into variable_4.
  4. Computes the NOR of the two 8-bit variables and stores the result
    These operations use transport-triggered operations built into SFR to implement logic.
    Four DMA2 blocks:
    -- move variable1 value to an SFR that supports (CLEAR, SET, INVERT) write locations.
    -- move variable2 value to the SFR SET write location.
    -- move 0xff value to the SFR INVERT write location.
    -- move NOR value in the SFR to variable_3
  5. Sets a print strobe, to be cleared by MAIN when the variables are printed.
    This is necessary because the DMA machine is completely asynchronous with the CPU
    -- One DMA2 block to move a 0x01 to the print strobe, which is cleared by the CPU in MAIN.
  6. Computes a conditional branch to see if the print strobe is cleared, and loop until it is cleared.
    Five DMA blocks to compute branch:
    -- move print_strobe to low byte of source address field (offset array) of next DMA block.
    This effectively multiplies the logical 0/1 to 0/4 because the jump address is 4 bytes
    -- move the offset to low byte of source address field (jump array) of next DMA block.
    This will select the jump address entry from the jump table.
    -- move the actual target block address to low two byte of source address field of the DMA0 block two blocks ahead.
    -- move the next block to DMA0 control registers.
    -- define the DMA0 block to be moved by the previous block.
  7. Increments a variable, modulo 3, to choose one of three output wavefroms to send to PortB
  8. Computes a conditional branch to one of three waveform generators based on the mod 3 variable:
    -- 1 microsec pulse
    -- 2 microsec pulse
    -- 8 microsec pulse
  9. Unconditional jump back to the beginning of the program (item 1 on this list)

Fetch/execute machine details:
The syntax given below assumes that DMA0 or DMA2 block images can be built by defining them using PLIB DMA commands, then copying the blocks (in the perparation stage) to the large array or arrays of block images. Macros hide the actual preparation and abstract the preparation to specifying the source address, destination address, source length (in bytes), destination length (in bytes), and the cell transfer length (in bytes). For example:
make_DMA2_block(LED_pattern2, (void*)&LATB, 64, 1, 64);
constructs and stores a block destined to be copied to the DMA2 control block which moves 64 bytes from a memory array, in a burst of 64, to the 1 byte port B output latch. The effect is to generate a burst of output transitions on the port, when the block is later loaded into the DMA2 control register and executed.

Direct Digital Synthesis -- A possible practical use for the DMA machine (and optimizing execution)
DDS uses a table-lookup to send sine values to a SPI-attached DAC. It is possible to do DMA transfer to the SPI using framed mode, which autogenerates a chip select on the channel slave-select line. However, the chip select is limited to one pin and there can only be one peripherial on the channel. The serial DMA machine allows you to define an arbitrary chip select pin and manipulate it. The downside is that the maximum speed for the transfer is around 11.4 Ksamples/sec (when using the standard 192 byte full DMA block definition). The example code waits for a timer event, toggles the chip select, sends two bytes through SPI to the DAC, increments an array pointer, then auto-loops back to wait the beginning for a timer event. To turn off the machine, just freeze timer3 so that another SPI device can access the bus. The demo code does this with a serial command.
-- The rate-limiting step in the DMA weird machine execution is loading the 192 byte blocks for every operation. Careful consideration of the contents of the DMA control block suggests that the last two words are not needed for this machine (unless you try to use transport-triggered compare). This shaves 32 bytes off. Another 12 bytes can be pulled off the end because each control register has three shadow registers for transport-triggered logic operations. The first word of the block is constant and can be set once, saving 16 bytes. The net result is 132 byte transfers which speeds up execution by about 1.5 times. The sample rate jumps to 18 Ksamples/second. Code.
-- Just running the DAC transfer as fast as possible with NO time-trigger control speeds up the sample rate to 23 Ksamples/sec. The speed up happens because the block size is cut to 100 bytes (minimum). The minimum size does not include the ability to set up a time trigger using the DMA block interrupt detect hardware. Code.
-- Changing the code to use 2-byte transfers to the SPI channel requires a modified increment table which limits the maximum sine resolution to 128 samples/cycle. The scheme makes a table in which the increment is two, rather than one. The effect is to remove two blocks from the DMA-block DDS loop, and raising the maximum synthesis frequency to 23.6 KHz (still with timer control). For the DDS sinewave this gives a frequency range from 2.95 Hz for a 8-sample sine to 184 Hz for a 128-sample sine. The 23.6 KHz synthesis rate corresponds to a timer interval of 1700 cycles. This means that changing the sample rate allows frequency control of better than 0.1%. Changing the length of the sine table by one sample yields frequency control of 1/(sine_table_size).
Code. <<use this version for DDS>>

Pseudorandom or random sequence generation
This example uses the CRC hardware module to generate a pseudorandom 16-bit number sequence. OPtionally, reading a floating ADC input adds some entropy to make the sequence truly random, but not cryptograhic quality. The sequence is output through the SPI DAC interface for spectral analysis. If the ADC is used, it is read every eighth interation of the LFSR, with 8-bits of the ADC reading XORed with the lower 8-bits of the LFSR seed. Running the CRC LFSR, emitting the SPI data, computing the conditional ADC read all runs at about 10KHz. The code needs a 16-bit SFR to use as a 16-bit ALU. The OCR5 set/reset registers were used. This version of the code optimizes for speed by eliminating possible timer control, so the system just runs as fast as it can. Eliminating SPI output would speed up random number genration about 30%. Eliminating the ADC read would speed it up by about 25%, but makes the sequence completely repeatable, and dependent on the initial seed chosen. The output noise spectrum drops with a 3db point at about 25% of the sample frequency and a minimum at the sample frequency at least 30db down.
Code (with ADC read every three LFSR operations)
Spectrum of DAC output with no ADC reads. Sample rate is about 16.8 KHz.


Older versions:

Time synced operation:
It is possible to sync overall machine operation to a timer by modifying one DMA2 block definition to trigger a transfer on a timer event. Note that this is a blocking-wait, which kills DMA execution until the timer event. The could be useful for a small program that, for example, sends a word to the SPI channel on a regular schedule to run a DAC. The sequential machine would wait for a timer event, drop the chip-select line, transfer a word to the SPI buffer, raise the chip-select line, then loop to wait for the next timer event. The DMA2 block definition which waits, then executes a NOP could be:
DmaChnOpen(2, 0, DMA_OPEN_AUTO);
DmaChnSetTxfer(2, &inc_value, &bit_bucket, 1, 1, 1);
DmaChnSetEventControl(2, DMA_EV_START_IRQ(_TIMER_3_IRQ));
DmaChnSetEvEnableFlags(2, DMA_EV_CELL_DONE);
DmaChnEnable(2);
memcpy(DMA_blocks+length_of_block*N, &DCH2CON, length_of_block);
N++;

This code runs the main DMA loop at 100 Hz by waiting for timer3 event.

Optimizing test code execution speed
-- The execution speed of the DMA machine is limited by the need to load a 192 byte control block for each operation. By reducing the felxibility of the machine, certain chunks of the DMA2 block do not need to be reloaded each time. An optimized version with about 1.4 speed-up minimizes DMA2 block updates, but still allows full functions described above. optimized code. The minimum execution time for one block dropped from 10 µsec to 7 µsec because the bytes per block were reduced from 192 to 132.
-- It is possible to optimize further, but the ability to trigger a block from an outside source (perhaps a timer) is lost. By eliminating the copy of the interrupt control registers, the copy count drops to 100 bytes, and the minimum block execution time drops to 5.5 µsec. The general test code above still runs, but time sync is much harder. The DCHxSSA, source address, register is the first address copied and the DCHxCSIZ, cell size register, is the last (see datasheet page 52).

A different (and probably inferior) way to run the Fetch/Execute cycle
The method used above is optimal in terms of wasting no cycles because the fetch/execute cycle is asychronous. As soon as an operation finishes, then next one can start. However, all four DMA channels are needed to make the machine run. One channel is the fetch unit, another is the execute unit, and two others are just used to clear interrupt flags in the first two channels. If a timer and output compare unit are used to generate two time-synched interrupt flags, then the two (fetch and execute) DMA channels can be triggered by the interrupt flags. The up side of this scheme is that it frees up two DMA channels. The down side is that the slowest operation determines the execution rate of the machine. Most operations are fairly fast, but add is much slower and branch is a little slower. Including add operation drops performance by a factor of 10. Branch operation drops performance by a factor of 2.5. Tuning becomes quite dificult. But for reference, a running code (without add operation) is included which runs about 0.4 as fast as the async code. Code.


Copyright Cornell University April 10, 2024