-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT bytecode optimizer #162
Comments
I have some initial experimental results. Objective: know how many time is needed by Lua to respond to a GPIO interrupt. Test code:
Wire: GPIO26 to GPIO14 CASE A: Lua RTOS with Lua locks & without JIT bytecode optimizer CASE B: Lua RTOS with Lua locks & JIT bytecode optimizer CASE C: Lua RTOS without Lua locks & JIT bytecode optimizer Conclussions: CASE A: use C if you need a response time < 60 usecs It is clear that the JIT has a good impact in the performance, and it is mandatory to determine in which situations it is safe to disable Lua locks. |
Hi guys, A first version of the JIT byte-code optimizer is available in 6763f11. |
@jolivepetrus what exactly are Lua locks? When it's safe to disable them? Is it needed only when multiple threads are accessing the same hardware? If I use only a dedicated thread to access a specific hardware or I use explicit mutexes in my code to disallow simultaneous access to it, can I safely disable the Lua locks? |
Lua locks are recursive mutexes that are used in the lua API to protect concurrent access to the Lua state. For example, when calling the lua_newuserdata, or the lua_pushinteger function, a lock is adquired before modifying the lua state (the structure that holds, for example the global variables) and released when the lua state is modified. The thing is that Lua RTOS is programmed in a way that lua locks can be omitted in certain circumstances:
With the actual Lua RTOS version, it should be safe to disable locks if:
Disabling lua locks, or minimize the use of lua locks is feasible, but requires some internal work to be transparent to the programmer. For now, please follow the above rules, and just program in the usual way. |
@jolivepetrus thanks for the explanation I've been testing this by executing 10-12 consecutive spi:readwrite operations, each sending 2 bytes. Nothing more in the program right now (wifi and network services disabled). I recorded the time between the NSS going down and back again and the time to NSS going down again.
So I got the best performance with both LL and JIT off. It's strange that with JIT on, there's almost no difference on whether the LL is on or off. BTW is disabling hardware locks supported? I wanted to try disabling them in Kconfig, but then I got some error with i2c unlock function missing during compilation. |
Please, attach your test code, to check the JIT optimizations. |
In theory the JIT's commits disables lua locks and read-only table cache if JIT is enabled: Cache: When accessing to readonly tables, Lua RTOS can get the key/value pair from a cache. This can Lua locks: Use locks when the program enters the Lua core. This option is disabled when the JIT bytecode For this reason you have no differences with JIT=on & (LL=on | LL = off). If JIT=on LL are always disabled. Now LL setting is present in Kconfig for compatibility, and will be removed from Kconfig soon. |
Ok, here's a little twist. I have two programs: one that I started implementing communication with RFM12B module and a stripped down which only enables SPI and sends some bytes. Here's stripped down code: local spiData = {
0x0123,
0x4567,
0x89ab,
0xcdef,
0x0000,
0x1111,
0x2222,
0x3333,
0x4444,
0x5555,
0x6666,
}
local dev
local function readwrite(data)
dev:select()
local ret = dev:readwrite((data & 0xff00) >> 8, data & 0xff)
dev:deselect()
return ret[0] << 8 | ret[1]
end
function runtest()
dev = spi.attach(spi.SPI3, spi.MASTER, pio.GPIO5, 2000000, 8, 0)
for i, cmd in ipairs(spiData) do
readwrite(cmd)
end
end And here's the full one: -- RFM12B
local FREQUENCY = 1664 -- math.floor((868320000 - 860000000) / 5000) -- 868.320 MHz
local FSK_SHIFT = 48 -- (math.floor(60000 / 15000) - 1) << 4 -- 60000
local DATARATE = 35 -- math.floor((10000000 / 29 / 9600) - 0.5)
local CMD_CFG = 0x8000
local CFG_EL = 0x80
local CFG_EF = 0x40
--local CFG_BAND_315 = 0x00
--local CFG_BAND_433 = 0x10
local CFG_BAND_868 = 0x20
--local CFG_BAND_915 = 0x30
--local CFG_XTAL_8_5PF = 0x00
--local CFG_XTAL_9_0PF = 0x01
--local CFG_XTAL_9_5PF = 0x02
--local CFG_XTAL_10_0PF = 0x03
--local CFG_XTAL_10_5PF = 0x04
--local CFG_XTAL_11_0PF = 0x05
--local CFG_XTAL_11_5PF = 0x06
--local CFG_XTAL_12_0PF = 0x07
local CFG_XTAL_12_5PF = 0x08
--local CFG_XTAL_13_0PF = 0x09
--local CFG_XTAL_13_5PF = 0x0A
--local CFG_XTAL_14_0PF = 0x0B
--local CFG_XTAL_14_5PF = 0x0C
--local CFG_XTAL_15_0PF = 0x0D
--local CFG_XTAL_15_5PF = 0x0E
--local CFG_XTAL_16_0PF = 0x0F
local CMD_PWRMGT = 0x8200
local PWRMGT_ER = 0x80
--local PWRMGT_EBB = 0x40
local PWRMGT_ET = 0x20
--local PWRMGT_ES = 0x10
--local PWRMGT_EX = 0x08
--local PWRMGT_EB = 0x04
--local PWRMGT_EW = 0x02
local PWRMGT_DC = 0x01
local CMD_FREQUENCY = 0xA000
local CMD_DATARATE = 0xC600
--local DATARATE_CS = 0x80
local CMD_RXCTRL = 0x9000
local RXCTRL_P16_VDI = 0x400
local RXCTRL_VDI_FAST = 0x000
--local RXCTRL_VDI_MEDIUM = 0x100
--local RXCTRL_VDI_SLOW = 0x200
--local RXCTRL_VDI_ALWAYS_ON = 0x300
--local RXCTRL_BW_400 = 0x20
--local RXCTRL_BW_340 = 0x40
--local RXCTRL_BW_270 = 0x60
local RXCTRL_BW_200 = 0x80
--local RXCTRL_BW_134 = 0xA0
--local RXCTRL_BW_67 = 0xC0
local RXCTRL_LNA_0 = 0x00
--local RXCTRL_LNA_6 = 0x08
--local RXCTRL_LNA_14 = 0x10
--local RXCTRL_LNA_20 = 0x18
local RXCTRL_RSSI_103 = 0x00
--local RXCTRL_RSSI_97 = 0x01
--local RXCTRL_RSSI_91 = 0x02
--local RXCTRL_RSSI_85 = 0x03
--local RXCTRL_RSSI_79 = 0x04
--local RXCTRL_RSSI_73 = 0x05
--local RXCTRL_RSSI_67 = 0x06
--local RXCTRL_RSSI_61 = 0x07
local CMD_DATAFILTER = 0xC228
local DATAFILTER_AL = 0x80
--local DATAFILTER_ML = 0x40
--local DATAFILTER_S = 0x10
local CMD_FIFORESET = 0xCA00
--local FIFORESET_SP = 0x08
--local FIFORESET_AL = 0x04
local FIFORESET_FF = 0x02
local FIFORESET_DR = 0x01
--local CMD_SYNCPATTERN = 0xCE00
local CMD_READ = 0xB000
local CMD_AFC = 0xC400
--local AFC_AUTO_OFF = 0x00
--local AFC_AUTO_ONCE = 0x40
local AFC_AUTO_VDI = 0x80
--local AFC_AUTO_KEEP = 0xC0
local AFC_LIMIT_OFF = 0x00
--local AFC_LIMIT_16 = 0x10
--local AFC_LIMIT_8 = 0x20
--local AFC_LIMIT_4 = 0x30
--local AFC_ST = 0x08
--local AFC_FI = 0x04
local AFC_OE = 0x02
local AFC_EN = 0x01
local CMD_TXCONF = 0x9800
--local TXCONF_MP = 0x100
local TXCONF_POWER_0 = 0x00
--local TXCONF_POWER_3 = 0x01
--local TXCONF_POWER_6 = 0x02
--local TXCONF_POWER_9 = 0x03
--local TXCONF_POWER_12 = 0x04
--local TXCONF_POWER_15 = 0x05
--local TXCONF_POWER_18 = 0x06
--local TXCONF_POWER_21 = 0x07
local CMD_PLL = 0xCC02
--local PLL_DDY = 0x08
--local PLL_DDIT = 0x04
local PLL_BW0 = 0x01
local CMD_TX = 0xB800
--local CMD_WAKEUP = 0xE000
--local CMD_DUTYCYCLE = 0xC800
--local DUTYCYCLE_ENABLE = 0x01
local CMD_STATUS = 0x0000
--local STATUS_RGIT = 0x8000
local STATUS_FFIT = 0x8000
--local STATUS_POR = 0x4000
--local STATUS_RGUR = 0x2000
--local STATUS_FFOV = 0x2000
--local STATUS_WKUP = 0x1000
--local STATUS_EXT = 0x0800
--local STATUS_LBD = 0x0400
--local STATUS_FFEM = 0x0200
--local STATUS_ATS = 0x0100
--local STATUS_RSSI = 0x0100
--local STATUS_DQD = 0x0080
--local STATUS_CRL = 0x0040
--local STATUS_ATGL = 0x0020
local CMD_RESET = 0xffff
local CMD_PWRMGT_DEFAULT = CMD_PWRMGT | PWRMGT_DC
--local CMD_PWRMGT_TRANSMIT = CMD_PWRMGT_DEFAULT | PWRMGT_ET
local CMD_PWRMGT_RECEIVE = CMD_PWRMGT_DEFAULT | PWRMGT_ER
local CMD_CLEAR_FIFO = CMD_FIFORESET | FIFORESET_DR | (8 << 4)
local CMD_ACCEPT_DATA = CMD_CLEAR_FIFO | FIFORESET_FF
local INIT_COMMANDS = {
CMD_CFG | CFG_EL | CFG_EF | CFG_BAND_868 | CFG_XTAL_12_5PF,
CMD_PWRMGT_DEFAULT,
CMD_FREQUENCY | FREQUENCY,
CMD_DATARATE | DATARATE,
CMD_RXCTRL | RXCTRL_P16_VDI | RXCTRL_VDI_FAST | RXCTRL_BW_200 | RXCTRL_LNA_0 | RXCTRL_RSSI_103,
CMD_DATAFILTER | DATAFILTER_AL | 4,
CMD_CLEAR_FIFO,
CMD_AFC | AFC_AUTO_VDI | AFC_LIMIT_OFF | AFC_OE | AFC_EN,
CMD_TXCONF | TXCONF_POWER_0 | FSK_SHIFT,
CMD_PLL | PLL_BW0,
CMD_PWRMGT_RECEIVE,
}
local dev
--local buffer = {}
buffer = {}
local packets = {}
--local recvd = 0
recvd = 0
local recheck = false
--local status
local packet_rcvd = event.create()
function rfm12_readwrite(data)
dev:select()
local ret = dev:readwrite((data & 0xff00) >> 8, data & 0xff)
dev:deselect()
return ret[0] << 8 | ret[1]
end
function rfm12_init()
dev = spi.attach(spi.SPI3, spi.MASTER, pio.GPIO5, 2000000, 8, 0)
thread.start(extafree_packet, 8 * 1024)
pio.pin.setdir(pio.OUTPUT, pio.GPIO16)
pio.pin.setdir(pio.OUTPUT, pio.GPIO12)
pio.pin.setlow(pio.GPIO12)
thread.sleepms(1)
pio.pin.sethigh(pio.GPIO12)
for i, cmd in ipairs(INIT_COMMANDS) do
rfm12_readwrite(cmd)
end
rfm12_readwrite(CMD_STATUS)
--pio.pin.interrupt(pio.GPIO21, rfm12_callback, pio.pin.IntrNegEdge, 100, 8 * 1024, 22)
pio.pin.interrupt(pio.GPIO21, rfm12_callback, pio.pin.IntrLowLevel, 100, 8 * 1024, 22)
rfm12_readwrite(CMD_CLEAR_FIFO)
rfm12_readwrite(CMD_ACCEPT_DATA)
end
function extafree_packet()
while true do
packet_rcvd:wait()
while #packets > 0 do
local packet = table.remove(packets, 1)
local cs = 0
for i = 1, 10 do
cs = cs + packet[i]
end
cs = cs & 0xff
print(table.concat(packet, ' '), cs == packet[11])
end
packet_rcvd:done()
end
end
function rfm12_callback()
while true do
status = rfm12_readwrite(CMD_STATUS)
recheck = false
if status & STATUS_FFIT ~= 0 then
recheck = true
recvd = recvd + 1
buffer[recvd] = rfm12_readwrite(CMD_READ) & 0xff
if recvd == 11 then
rfm12_readwrite(CMD_CLEAR_FIFO)
rfm12_readwrite(CMD_ACCEPT_DATA)
table.insert(packets, buffer)
buffer = {}
recvd = 0
packet_rcvd:broadcast()
end
end
if not recheck then break end
end
end Don't worry about the rest of the program in the full code. I'm testing it right now without anything connected to the SPI except the logic analyser. When you run |
New optimizations have been added in the JIT (see f88aa17). Also there are changes in the Lua spi module (readwrite function). Now the function returns an array t[k] with k >= 1, that ensures that the result table is an array and no table resizes are done when the table is build. In the previous version the array was t[k], with k >= 0, but this was not an array (in Lua arrays are indexed from 1 to array size) and caused continuous table resizes. You should see a 150 usecs time between NSS with the JIT enabled. It is not feasible to get lower values without changes in the SPI driver, but I have some ideas to implement in the driver to reduce the transaction duration. |
@jolivepetrus |
@jolivepetrus I don't know if this is a good issue to post this, maybe I should open a new issue for the SPI? I've tested the SPI performance and my conclusion is that the moment the NSS pin is activated/deactivated only helps partially. I've tried with and without the new BTW I've noticed that the NSS now rises up at the exact same time as the last clock edge falls down on the transfer. Is it possible to introduce a little delay? The time between NSS going down and the first clock edge is 1/2 of clock cycle, which should also be ok for the NSS going up after the last clock edge. Or add ability to add a number of dummy bits at the end. Some devices don't like NSS going up on the last clock edge. For example, the RFM12B datasheet calls this time "Select hold time" and requires it to be at least 25 ns. |
We can introduce a small delay before NSS goes high, but not when NSS goes low due to an ESP32 errata (this only works in half-duplex). Maybe it's better to make SPI_FLAG_CS_AUTO = 0 by default, this introduces a small delay. |
As Lua RTOS makes intensive use of read-only tables, a series of optimizations can be performed on the program's bytecode to speed-up the program execution. This can help that programs written for Lua RTOS to have a similar performance than the writtens in C, and takes a special importance when the programmer use the Lua RTOS hardware-access modules.
We have start the work on an initial version of the Lua RTOS JIT bytecode optimizer, with excellent results.
The text was updated successfully, but these errors were encountered: