Latency, error finishing read, and IRQ affinity

More
10 Feb 2024 02:51 - 01 Mar 2024 19:10 #292969 by mozmck
****** UPDATE - there is a setup script at the end of this post now that will set up kernel boot options and IRQ affinity scripts ******

Well I just had the "error finishing read" problem with a new system I'm setting up, and the problem turned out to be an issue with the script I use to set IRQ affinity for the network card which connects to the Mesa board.

This particular system is a mini PC with a realtek RTL8168 ethernet chip, and a quad core N95 CPU, running custom built 6.6.2-rt kernel and LinuxCNC 2.9.2.  I had it working fine for a while, and then it started giving the "error finishing read" error right after starting LinuxCNC every time.  After fixing my script to set IRQ affinity properly, I ran a 2.5 hour cut file while playing a 4K video on youtube over the wireless link without an error, and a servo-thread.tmax of about 588000 with a 1697 Mhz CPU.

In the past I have used this same process with great results with 5.4 and 5.10 rt kernels running on various mini PCs mostly having realtek network chips.  There are several things that must be in place for this to work properly, so I'm writing this post in hopes that it may help.

First, The kernel command line option 'isolcpus' must be used to isolate the last (highest numbered) CPU so that the kernel does not run any tasks on it unless they are specifically made to run on it.  LinuxCNC always runs its realtime task on that last CPU.
The only time any CPU other than the last one needs to be isolated is when another CPU shares a cache with it.  The command below run in a terminal will print a comma separated list of the last CPU and any others that share a cache with it.
lscpu -p | tac | awk -F, '{ if(FNR==1) {lastcore=$2; cpu=$1; cpu0=$1} else if ($2==lastcore && cpu0>1) {cpu = $1 "," cpu} } END {if(cpu0>0) {print cpu} }'

On most quad core intel cpus I've seen recently such as the N95, that command will return "3" - which is the last core.  On my Ryzen 6-core machine it returns "5,11"

Edit the file /etc/default/grub:
sudo nano /etc/default/grub
Add isolcpus={whatever the command above printed} to the line beginning with GRUB_CMDLINE_LINUX_DEFAULT=
For example:
GRUB_CMDLINE_LINUX_DEFAULT="quiet isolcpus=5,11"
Now run: 
sudo update-grub

Now IRQ affinity is a little bit tricky.  If the irqbalance daemon is running, it typically runs every 10 seconds and moves IRQ handlers based on their activity to different CPUs to balance the load as evenly as it can.  It also automatically keeps IRQs off of the CPUs isolated by 'isolcpus'.  In this case we want the IRQ of the realtime network card to be handled on the isolated last CPU along with the LinuxCNC realtime task.
The IRQ can be moved by finding out which number it is, and then running the command: sudo echo "3" >  /proc/irq/#/smp_affinity_list  where "3" is the last CPU and # is the number of the IRQ for the network card.  The problem with this is that in a little bit irqbalance will move it back off that CPU because it is isolated!
You can disable irqbalance, but another problem is that setting the smp_affinity is only good until you reboot even if irqbalance is not running.
Yet another problem is that if you change the network card or port connected to the Mesa card, the IRQ number will change, meaning that if you hard-coded an IRQ number in a script it will have to be changed.
Fortunately irqbalance has a way to ban itself from moving certain IRQs, and a way to run a script when it starts up as it sets up each IRQ.  So my solution was to write a script which figures out which IRQ we need, ban irqbalance from moving it, and then set the affinity of that IRQ to the last CPU.

To use the script, I saved it in /etc/irqbalance.d (I think I created the directory), and made it executable.  Remove the .txt from the end.  It assumes the card is setup in the /etc/network/interfaces file with the IP 10.10.10.1 - just change the IP to match the one you use, or you could even simply change the NIC= line and enter the name of the network device directly, such as: NIC="enp1s0"

Then you have to change a line in /etc/default/irqbalance to use this script:
IRQBALANCE_ARGS="--policyscript=/etc/irqbalance.d/lcnc_irqpolicy.sh"

 

File Attachment:

File Name: lcnc_irqpolicy.sh.txt
File Size:0 KB


I hope this is helpful, and I might be able to post a script to automate setting some of this stuff up soon.
Moses

[Edit]
I forgot to add a couple of things useful for test and debug:
After you reboot with the script in place, you can run the following command to see if it worked:
journalctl -b -u irqbalance.service

You should see a line that says something like: ... /usr/sbin/irqbalance[611]: IRQ 125: Override ban to true

Another way to see what is happening with interrupts is to run the following command and look for the ethernet card in the list and see which CPU is handling the interrupts for it.
watch -n1 -d cat /proc/interrupts

That can be run in a terminal while running LinuxCNC - or while pinging the Mesa card and you will see the number for the ethernet IRQ climbing on the CPU which is handling it.

[Edit 2] ****** UPDATE ******
Here is a script which will setup everything automatically. It will do the following:
  1. Set up isolcpus and a couple of other kernel arguments
  2. Run update-grub
  3. For systems with irqbalance running: creates the IRQ affinity policy script for irqbalance and sets up irqbalance to use it.
  4. For systems without irqbalance running: creates a script which manually sets all IRQ affinities and sets it up to be run at boot from /etc/rc.local

There are a few kernel options that *might* be helpful in reducing latency that can be enabled by uncommenting the relevant sections.
There are a couple of lines at the end which disable syncing the system clock to internet time servers.  This syncing has in the past been one cause of realtime errors.  This might should be made default???

Rename the script to "rt_setup.sh" and make it executable, then run it through sudo:
sudo ./rt_setup.sh
After running this you will need to reboot the PC.

File Attachment:

File Name: rt_setup.sh.txt
File Size:9 KB
Attachments:
Last edit: 01 Mar 2024 19:10 by mozmck. Reason: updated setup script
The following user(s) said Thank You: Dean, seuchato, tommylight, Clive S, rodw, Elco, Swag, BryceJ, pinsyd1, gunrak and 7 other people also said thanks.

Please Log in or Create an account to join the conversation.

More
10 Feb 2024 07:56 #292978 by Mecanix
EPIC! It works. Regardless of the app you start while Lcnc is running (browser, audio, whatever) both the Servo Thread & Base Threads remains identical, low and unaffected. Thank YOU!

Please Log in or Create an account to join the conversation.

More
10 Feb 2024 16:05 #293001 by Mecanix
Kinda strange it works flawlessly onto a i7 (4 cores) but not onto Core duo (2 cores). Unsupported on <2 cores?

Intel i7 Ok!!
cnc@debian-i7:~$ sudo /etc/init.d/irqbalance status
● irqbalance.service - irqbalance daemon
     Loaded: loaded (/lib/systemd/system/irqbalance.service; enabled; preset: enabled)
     Active: active (running) since Sat 2024-02-10 10:41:42 EST; 12min ago
       Docs: man:irqbalance(1)
             https://github.com/Irqbalance/irqbalance
   Main PID: 484 (irqbalance)
      Tasks: 2 (limit: 28409)
     Memory: 2.7M
        CPU: 220ms
     CGroup: /system.slice/irqbalance.service
             └─484 /usr/sbin/irqbalance --foreground --policyscript=/etc/irqbalance.d/lcnc_irqpolicy.sh

Feb 10 10:41:42 debian-i7 systemd[1]: Started irqbalance.service - irqbalance daemon.
Feb 10 10:41:52 debian-i7 /usr/sbin/irqbalance[484]: IRQ 25: Override ban to true
 


Intel Core Duo, Not Ok
cnc@debian-duo:~$ sudo /etc/init.d/irqbalance status
○ irqbalance.service - irqbalance daemon
     Loaded: loaded (/lib/systemd/system/irqbalance.service; enabled; preset: enabled)
     Active: inactive (dead)
  Condition: start condition failed at Sat 2024-02-10 10:53:46 EST; 7min ago
             └─ ConditionCPUs=>1 was not met
       Docs: man:irqbalance(1)
             https://github.com/Irqbalance/irqbalance

Feb 10 10:53:46 debian-linuxcnc systemd[1]: irqbalance.service - irqbalance …1).
Hint: Some lines were ellipsized, use -l to show in full.
 
The following user(s) said Thank You: seuchato

Please Log in or Create an account to join the conversation.

More
10 Feb 2024 16:12 #293002 by mozmck
I'll have to research that more but it kinda looks like it's saying irqbalance won't run if there are not more than 1 CPUs which makes sense. If isolcpus is used then there is only 1 CPU irqbalance could use so it doesn't make sense to run it.

If irqbalance is not running you will need to run a script similar to mine to find the right IRQ and set the affinity at boot time. I think that can be done using rc.local.
The following user(s) said Thank You: seuchato

Please Log in or Create an account to join the conversation.

More
10 Feb 2024 16:25 #293004 by Mecanix
I'm personally not too bothered with the limitation. The i7 serves the network/mesa interfaces, which I need your irq hack to work for. While the Core Duo is used for a parport tool. Enhancement on the i7 is phenomenal in my case (btw, thanks!).

Just thought you'd like to know about those vintage Core Duo limitation.
The following user(s) said Thank You: seuchato

Please Log in or Create an account to join the conversation.

More
10 Feb 2024 16:33 #293006 by mozmck
Something like this should work:
#!/bin/bash
MASK=$( printf "%X" $((1<<`lscpu -e=CPU | tail -1`)) )
NIC=`awk 'BEGIN{nic=""} {if ($1=="iface") {tmpnic = $2} if ($1=="address") {if ($2 == 10.10.10.1) {nic = tmpnic}}} END{print(nic)}' /etc/network/interfaces`

if [ "$NIC" != "" ]; then
    grep $NIC /proc/interrupts | cut -d ":" -f 1 | while read -r IRQ; do
        echo $MASK > /proc/irq/$IRQ/smp_affinity
    done
fi
exit 0"

Save the above as a script somewhere and make it executable, then call it from /etc/rc.local

You can see if it's working by running the following command in a terminal while LinuxCNC is running.  You should see the network card IRQ numbers increasing on the second CPU.
watch -n1 -d cat /proc/interrupts

The only thing I'm not sure of is if the rest of the IRQs will be handled by default on the first CPU.  This is the way it was in the past but I saw some mention that newer kernels may distribute IRQ handlers across CPUs.  If this is the case then the script needs to go through every other interrupt and move it to the first CPU.
The following user(s) said Thank You: seuchato, Mecanix

Please Log in or Create an account to join the conversation.

More
10 Feb 2024 16:37 #293008 by mozmck
It seems like in the past I had tested on dual core and had it working, but it could be that irqbalance or maybe the service that starts it has been changed in newer versions/distributions. The problem I had recently was due to the fact that it did 'exit 1' which worked fine in the past, but now it had to be 'exit 0'
The following user(s) said Thank You: seuchato

Please Log in or Create an account to join the conversation.

More
10 Feb 2024 17:02 #293011 by Mecanix
Nice one! Unfortunately the great Ice-Age, Medieval, Antiquated infamous Intel Core Duo's not happy with it. I can only get the service to start and the script executed with GRUB_CMDLINE_LINUX_DEFAULT="quiet". a.k.a. No isolcpus but then that defeats the purpose. Fails to start the service with isolcpus so I think you were initially right. Or perhaps I'm not doing it right, let's see if someone else with a similar coreduo set-up can. 
cnc@debian-duo:~$ sudo /etc/init.d/irqbalance status
● irqbalance.service - irqbalance daemon
     Loaded: loaded (/lib/systemd/system/irqbalance.service; enabled; preset: enabled)
     Active: active (running) since Sat 2024-02-10 11:56:07 EST; 2min 23s ago
       Docs: man:irqbalance(1)
             https://github.com/Irqbalance/irqbalance
   Main PID: 370 (irqbalance)
      Tasks: 2 (limit: 6868)
     Memory: 1.7M
        CPU: 222ms
     CGroup: /system.slice/irqbalance.service
             └─370 /usr/sbin/irqbalance --foreground --policyscript=/etc/irqbalance.d/lcnc_irqpolicy.sh

Feb 10 11:56:07 debian-duo systemd[1]: Started irqbalance.service - irqbalance daemon.
 

Please Log in or Create an account to join the conversation.

More
10 Feb 2024 17:11 #293012 by tommylight
Replied by tommylight on topic Latency, error finishing read, and IRQ affinity
Made this sticky, thank you.
The following user(s) said Thank You: seuchato

Please Log in or Create an account to join the conversation.

More
10 Feb 2024 18:03 #293018 by mozmck
Mecanix, the second script is for use without irqbalance running. It should be run from /etc/rc.local and not as a policyscript for irqbalance. I haven't tested it myself yet, but I know I have a dual-core or two around here that I can try it on. Probably be next week or so before I can find time to do that.

Please Log in or Create an account to join the conversation.

Time to create page: 0.094 seconds
Powered by Kunena Forum