Systematic approach to tracking down latency issue
- hase
- Offline
- New Member
Less
More
- Posts: 9
- Thank you received: 2
06 Aug 2015 20:10 #61187
by hase
Systematic approach to tracking down latency issue was created by hase
Hi all,
I am fed up.
Actually, I am *very* happy with LinuxCNC and essentially love it.
And that is why I am so thankful for all the work that so many developers are putting in. Big Hug!
But latency issues keep driving me mad.
Interrupt: I am a firmware and hardware engineer by profession and I'm using LinuxCNC-driven homegrown milling machines as a hobby: sometimes hacking on bits and bytes isn't cutting it for me; then I need the sound of metal being chipped or the smell and sound of wood made into sawdust.
Anyway.
One of my PCs driving my experimental mill needed an upgrade: I put in an SSD, installed Kubuntu 12.04, installed the rtai kernel and linuxcnc and restored my configurations from a backup.
A failed hard disk made this operation necessary and I used the chance to upgrade to SSD.
This operation beamed me right back to latency hell.
On a machine that I had working just fine with actually good latency values (measured in latency-test).
I would love to do a systematic search for the problem, but I can find no documentation on this.
There are a number of tips-n-tricks type docs like wiki.linuxcnc.org/cgi-bin/wiki.pl?RealTime but is there a systematic approach to find out what caused latency?
When it comes to task-invocation-latency, I am basically in the dark (which feels bad for an engineer
From the RTAI documentation I would assume that once the rtai module is mounted, it takes priority over interupts and task scheduling.
I therefore understand how power-saving features could ruin latency: if the CPU has to wake up from deep sleep before servicing the timer interrupt, that will cause latency. This is why the Core2Duo in one of my machine controllers does benefit from the idle=poll parameter on the command line so much (without idle=poll => unusable, with idle=poll => good latency values mesured).
But why would swapping the hard disk for an SSD (and hooking up a secondary hard disk) impact latency?
tl;dr:
- is there a way to figure out what caused a scheduling delay?
- would it help to fiddle with SMP interrupt affinity? Is that even possible under RTAI?
merci bien
Hartmut "hase" Semken
I am fed up.
Actually, I am *very* happy with LinuxCNC and essentially love it.
And that is why I am so thankful for all the work that so many developers are putting in. Big Hug!
But latency issues keep driving me mad.
Interrupt: I am a firmware and hardware engineer by profession and I'm using LinuxCNC-driven homegrown milling machines as a hobby: sometimes hacking on bits and bytes isn't cutting it for me; then I need the sound of metal being chipped or the smell and sound of wood made into sawdust.
Anyway.
One of my PCs driving my experimental mill needed an upgrade: I put in an SSD, installed Kubuntu 12.04, installed the rtai kernel and linuxcnc and restored my configurations from a backup.
A failed hard disk made this operation necessary and I used the chance to upgrade to SSD.
This operation beamed me right back to latency hell.
On a machine that I had working just fine with actually good latency values (measured in latency-test).
I would love to do a systematic search for the problem, but I can find no documentation on this.
There are a number of tips-n-tricks type docs like wiki.linuxcnc.org/cgi-bin/wiki.pl?RealTime but is there a systematic approach to find out what caused latency?
When it comes to task-invocation-latency, I am basically in the dark (which feels bad for an engineer
From the RTAI documentation I would assume that once the rtai module is mounted, it takes priority over interupts and task scheduling.
I therefore understand how power-saving features could ruin latency: if the CPU has to wake up from deep sleep before servicing the timer interrupt, that will cause latency. This is why the Core2Duo in one of my machine controllers does benefit from the idle=poll parameter on the command line so much (without idle=poll => unusable, with idle=poll => good latency values mesured).
But why would swapping the hard disk for an SSD (and hooking up a secondary hard disk) impact latency?
tl;dr:
- is there a way to figure out what caused a scheduling delay?
- would it help to fiddle with SMP interrupt affinity? Is that even possible under RTAI?
merci bien
Hartmut "hase" Semken
Please Log in or Create an account to join the conversation.
- BigJohnT
- Offline
- Administrator
Less
More
- Posts: 7000
- Thank you received: 1172
06 Aug 2015 23:56 #61195
by BigJohnT
Replied by BigJohnT on topic Systematic approach to tracking down latency issue
Is there some reason you didn't use the LiveCD and install LinuxCNC with Debian Wheezy?
JT
JT
Please Log in or Create an account to join the conversation.
- hase
- Offline
- New Member
Less
More
- Posts: 9
- Thank you received: 2
07 Aug 2015 01:35 #61198
by hase
Replied by hase on topic Systematic approach to tracking down latency issue
'Well, kinda.
I started with LinuxCNC 2.4 on Ubuntu 10.04 back in the day, just followed the upgrade path there.
The Ubuntu Startup Disk Creator will not make an USB installer from the LiveCD (complains about a bad distro name iirc.), so I went for the Ubuntu version again.
Do you think that Wheezy would make a difference?
It would be the same kernel running the same hardware, right?
merci
hase
I started with LinuxCNC 2.4 on Ubuntu 10.04 back in the day, just followed the upgrade path there.
The Ubuntu Startup Disk Creator will not make an USB installer from the LiveCD (complains about a bad distro name iirc.), so I went for the Ubuntu version again.
Do you think that Wheezy would make a difference?
It would be the same kernel running the same hardware, right?
merci
hase
Please Log in or Create an account to join the conversation.
- ArcEye
- Offline
- Junior Member
Less
More
- Posts: 24
- Thank you received: 758
07 Aug 2015 15:27 #61201
by ArcEye
www.linuxcnc.org/index.php/english/forum...-the-latency-problem
www.linuxcnc.org/index.php/english/forum...me-latency-solutions
Replied by ArcEye on topic Systematic approach to tracking down latency issue
I would love to do a systematic search for the problem, but I can find no documentation on this.
www.linuxcnc.org/index.php/english/forum...-the-latency-problem
www.linuxcnc.org/index.php/english/forum...me-latency-solutions
Please Log in or Create an account to join the conversation.
- BigJohnT
- Offline
- Administrator
Less
More
- Posts: 7000
- Thank you received: 1172
07 Aug 2015 19:02 #61204
by BigJohnT
I don't have a clue but it seems the normal way is to start with what everyone else is using instead of rolling your own system.
JT
Replied by BigJohnT on topic Systematic approach to tracking down latency issue
'
Do you think that Wheezy would make a difference?
It would be the same kernel running the same hardware, right?
merci
hase
I don't have a clue but it seems the normal way is to start with what everyone else is using instead of rolling your own system.
JT
Please Log in or Create an account to join the conversation.
- hase
- Offline
- New Member
Less
More
- Posts: 9
- Thank you received: 2
09 Aug 2015 19:47 #61238
by hase
Replied by hase on topic Systematic approach to tracking down latency issue
I followed your advice and run off the live-CD now.
Latency values as mesured by latency-test and by /usr/realtime-3.4-9-rtai-686-pae/testsuite/kern/latency/run are getting worse (200+ milliseconds; my 12.04+RTAI setup peaked at 65ms before I added the SSD and at 120ms with the SSD)
This comes as no surprise:
- the kernels in all my experiments are the same: uname -a => Linux debian 3.4-9-rtai-686-pae #1 SMP PREEMPT Debian 3.4.55-4linuxcnc i686 GNU/Linux
- my kernel command line includes isolcpus=1, the Live-CD does not isolate one core from interrupts; this has helped in the past
I am still looking for a systematic approach to find out, what is causing the interrupt-delays.
My current suspect is SMI, as this seems to be the only interrupt that is beyond RTAI control.
OTOH there is no periodic latency spike when I run /usr/realtime-3.4-9-rtai-686-pae/testsuite/kern/latency/run.
I can reliably cause a pretty big latency spike (70..120ms) when I do disk-IO, e.g.
tar cvf /dev/null /usr
reading the SSD causes the smaller spikes, reading the rotating disk causes larger spikes.
Repeating the test does not cause any spikes (!).
And the repeat test on the disk also creates much less head-travel noise; obviously a lot of stuff has been cached by the kernel.
How can the disk-IO cause RT-delays?
The SATA controller does not even share his interrupt
cat /proc/interrupts
CPU0 CPU1
0: 40 0 IO-APIC-edge timer
1: 46 4720 IO-APIC-edge i8042
7: 0 0 IO-APIC-edge parport0
8: 0 1 IO-APIC-edge rtc0
9: 0 0 IO-APIC-fasteoi acpi
12: 236 23380 IO-APIC-edge i8042
14: 0 0 IO-APIC-edge pata_atiixp
15: 0 0 IO-APIC-edge pata_atiixp
16: 3671 155861 IO-APIC-fasteoi ohci_hcd:usb3, ohci_hcd:usb4, eth0
17: 171 21973 IO-APIC-fasteoi ehci_hcd:usb1
18: 294 33188 IO-APIC-fasteoi ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7, radeon
19: 0 30 IO-APIC-fasteoi ehci_hcd:usb2, snd_hda_intel
21: 0 4 IO-APIC-fasteoi
22: 20 5644 IO-APIC-fasteoi ahci
NMI: 7 33 Non-maskable interrupts
LOC: 40913 54300 Local timer interrupts
SPU: 0 0 Spurious interrupts
PMI: 7 33 Performance monitoring interrupts
IWI: 0 0 IRQ work interrupts
RTR: 0 0 APIC ICR read retries
RES: 274100 204536 Rescheduling interrupts
CAL: 288 235 Function call interrupts
TLB: 5539 5572 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
MCE: 0 0 Machine check exceptions
MCP: 6 6 Machine check polls
another set of experiments points somewhat towards the VGA controller.
I am using an oder ATI (ATI RV620 LE [Radeon HD 3450]) card and Debian Wheezy like Ubuntun 12.04 load the "radeon" driver.
I remember vaguely from other experiments (years ago on 10.04/LinuxCNC 2.4) that in some computers switching to fglrx did improve latency a lot, in other setups the fglrx driver ruined latency completely.
I am still fishing in the dark here.
Actually, I feel like looking for the fishing pole in a pitch-black room.
I have skimmed over/read lots of RTAI docs in the past few days, checked all the "this might help"-style recipies in the LinuxCNC Wiki (including the docs pointed to by ArkEye) again and still have no conclusive picture.
What really puzzles me is the fact that this exact machine with exactly these settings in the BIOS (APM ovv, Coo'n'quiet off etc.) did work just fine with latency peaking below 50ms and now shows values above 200ms.
Where would I look for the cause?
merci
hase
Latency values as mesured by latency-test and by /usr/realtime-3.4-9-rtai-686-pae/testsuite/kern/latency/run are getting worse (200+ milliseconds; my 12.04+RTAI setup peaked at 65ms before I added the SSD and at 120ms with the SSD)
This comes as no surprise:
- the kernels in all my experiments are the same: uname -a => Linux debian 3.4-9-rtai-686-pae #1 SMP PREEMPT Debian 3.4.55-4linuxcnc i686 GNU/Linux
- my kernel command line includes isolcpus=1, the Live-CD does not isolate one core from interrupts; this has helped in the past
I am still looking for a systematic approach to find out, what is causing the interrupt-delays.
My current suspect is SMI, as this seems to be the only interrupt that is beyond RTAI control.
OTOH there is no periodic latency spike when I run /usr/realtime-3.4-9-rtai-686-pae/testsuite/kern/latency/run.
I can reliably cause a pretty big latency spike (70..120ms) when I do disk-IO, e.g.
tar cvf /dev/null /usr
reading the SSD causes the smaller spikes, reading the rotating disk causes larger spikes.
Repeating the test does not cause any spikes (!).
And the repeat test on the disk also creates much less head-travel noise; obviously a lot of stuff has been cached by the kernel.
How can the disk-IO cause RT-delays?
The SATA controller does not even share his interrupt
cat /proc/interrupts
CPU0 CPU1
0: 40 0 IO-APIC-edge timer
1: 46 4720 IO-APIC-edge i8042
7: 0 0 IO-APIC-edge parport0
8: 0 1 IO-APIC-edge rtc0
9: 0 0 IO-APIC-fasteoi acpi
12: 236 23380 IO-APIC-edge i8042
14: 0 0 IO-APIC-edge pata_atiixp
15: 0 0 IO-APIC-edge pata_atiixp
16: 3671 155861 IO-APIC-fasteoi ohci_hcd:usb3, ohci_hcd:usb4, eth0
17: 171 21973 IO-APIC-fasteoi ehci_hcd:usb1
18: 294 33188 IO-APIC-fasteoi ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7, radeon
19: 0 30 IO-APIC-fasteoi ehci_hcd:usb2, snd_hda_intel
21: 0 4 IO-APIC-fasteoi
22: 20 5644 IO-APIC-fasteoi ahci
NMI: 7 33 Non-maskable interrupts
LOC: 40913 54300 Local timer interrupts
SPU: 0 0 Spurious interrupts
PMI: 7 33 Performance monitoring interrupts
IWI: 0 0 IRQ work interrupts
RTR: 0 0 APIC ICR read retries
RES: 274100 204536 Rescheduling interrupts
CAL: 288 235 Function call interrupts
TLB: 5539 5572 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
MCE: 0 0 Machine check exceptions
MCP: 6 6 Machine check polls
another set of experiments points somewhat towards the VGA controller.
I am using an oder ATI (ATI RV620 LE [Radeon HD 3450]) card and Debian Wheezy like Ubuntun 12.04 load the "radeon" driver.
I remember vaguely from other experiments (years ago on 10.04/LinuxCNC 2.4) that in some computers switching to fglrx did improve latency a lot, in other setups the fglrx driver ruined latency completely.
I am still fishing in the dark here.
Actually, I feel like looking for the fishing pole in a pitch-black room.
I have skimmed over/read lots of RTAI docs in the past few days, checked all the "this might help"-style recipies in the LinuxCNC Wiki (including the docs pointed to by ArkEye) again and still have no conclusive picture.
What really puzzles me is the fact that this exact machine with exactly these settings in the BIOS (APM ovv, Coo'n'quiet off etc.) did work just fine with latency peaking below 50ms and now shows values above 200ms.
Where would I look for the cause?
merci
hase
Please Log in or Create an account to join the conversation.
- BigJohnT
- Offline
- Administrator
Less
More
- Posts: 7000
- Thank you received: 1172
09 Aug 2015 22:58 #61243
by BigJohnT
Replied by BigJohnT on topic Systematic approach to tracking down latency issue
Please Log in or Create an account to join the conversation.
- RichJordan
- Offline
- New Member
Less
More
- Posts: 16
- Thank you received: 0
05 Mar 2019 19:33 #127814
by RichJordan
Replied by RichJordan on topic Systematic approach to tracking down latency issue
Hi Hase,
I know this thread stopped a few years ago. Did you ever solve this issue? I have just realised I have a similar problem, good latency but with unbearable spikes. Mostly, spikes seem coincidental to actions and repeating, like disk accessing, doesn’t repeat the spike. I’ve tried 2 graphics cards with ATI better than NVidia but clearly the cause of spikes.
Any, I’d love to know if you had any joy here.
Cheers
Richard
I know this thread stopped a few years ago. Did you ever solve this issue? I have just realised I have a similar problem, good latency but with unbearable spikes. Mostly, spikes seem coincidental to actions and repeating, like disk accessing, doesn’t repeat the spike. I’ve tried 2 graphics cards with ATI better than NVidia but clearly the cause of spikes.
Any, I’d love to know if you had any joy here.
Cheers
Richard
Please Log in or Create an account to join the conversation.
- RichJordan
- Offline
- New Member
Less
More
- Posts: 16
- Thank you received: 0
05 Mar 2019 19:36 #127815
by RichJordan
Replied by RichJordan on topic Systematic approach to tracking down latency issue
* but not clearly the cause of spikes. I should add these spikes do not occur very often. Maybe once in 5 or 10 minutes.
Please Log in or Create an account to join the conversation.
- hase
- Offline
- New Member
Less
More
- Posts: 9
- Thank you received: 2
06 Mar 2019 00:02 #127840
by hase
Replied by hase on topic Systematic approach to tracking down latency issue
I never really found a systematic way.
I gave up when I found all the measurement programs to show not only different but contradictionary measurements.
Also, my mesurements using external hardware I hacked together - like comparing a software-generated pin-toggle in every cycle of the base thread to a frequency generator - did not help.
I was also unable to rebuild even an official kernel - that is: one shipped with LinuxCNC.
I can confirm, that all Nvidia graphics I tried are making the PC useless for LinuxCNC.
Stay away from anything Nvidia.
The Radeons work well for me.
One reason I was able to identify through elimination: when I swapped an older hard drive for one supporting SMART, a perfectly fine working controller became useless.
Turning off SMART support in the BIOS Setup fixed it.
Obviously, handling SMART diagnosis has a higher priority than the timer interrupt - which it should not have.
As a result of all my frustration I just switched to Mesa FPGA cards for step generation.
Their documentation was a bit confusing at first, but eventually I figured it out.
I have one controller with a PCI FPGA card and the Mesa driver/breakout card, another one with just the (cheap!) FPGA card any my own little driver/breakout board.
In both controllers the base thread is now empty except for a pin toggle on a parallel port. I feed this 50%-duty PWM to an external watchdog (a window comparator checking the analog voltage to be between 40% and 60% logic level of the parport tripping an RS-Flipflop; so if the base thread stops completely, the watchdog turns off the power to the drives and spindle).
A bit of jitter in this function is no problem (covered by the rather large window) and the FPGA make silky-smooth steps
Any btw: I stocked up on Core-II-Duo based Dell Optiplexes and Fujitsus when they were still available.
They have enough CPU horsepower for LinuxCNC with Axis and perform much better latency-wise than anything more modern I tried.
The more nifty stuff the chipset has built-in, the more hickups the BIOS seems to be able to cause.
But again: sorry, I have no systematic, just anecdotal evidence.
I gave up when I found all the measurement programs to show not only different but contradictionary measurements.
Also, my mesurements using external hardware I hacked together - like comparing a software-generated pin-toggle in every cycle of the base thread to a frequency generator - did not help.
I was also unable to rebuild even an official kernel - that is: one shipped with LinuxCNC.
I can confirm, that all Nvidia graphics I tried are making the PC useless for LinuxCNC.
Stay away from anything Nvidia.
The Radeons work well for me.
One reason I was able to identify through elimination: when I swapped an older hard drive for one supporting SMART, a perfectly fine working controller became useless.
Turning off SMART support in the BIOS Setup fixed it.
Obviously, handling SMART diagnosis has a higher priority than the timer interrupt - which it should not have.
As a result of all my frustration I just switched to Mesa FPGA cards for step generation.
Their documentation was a bit confusing at first, but eventually I figured it out.
I have one controller with a PCI FPGA card and the Mesa driver/breakout card, another one with just the (cheap!) FPGA card any my own little driver/breakout board.
In both controllers the base thread is now empty except for a pin toggle on a parallel port. I feed this 50%-duty PWM to an external watchdog (a window comparator checking the analog voltage to be between 40% and 60% logic level of the parport tripping an RS-Flipflop; so if the base thread stops completely, the watchdog turns off the power to the drives and spindle).
A bit of jitter in this function is no problem (covered by the rather large window) and the FPGA make silky-smooth steps
Any btw: I stocked up on Core-II-Duo based Dell Optiplexes and Fujitsus when they were still available.
They have enough CPU horsepower for LinuxCNC with Axis and perform much better latency-wise than anything more modern I tried.
The more nifty stuff the chipset has built-in, the more hickups the BIOS seems to be able to cause.
But again: sorry, I have no systematic, just anecdotal evidence.
Please Log in or Create an account to join the conversation.
Time to create page: 0.072 seconds