Reducing latency on multicore pc's - Success!

More
10 Dec 2012 10:03 - 14 Dec 2012 00:39 #27492 by Nebur
Hi,

think I should introduce myself a little as this is my first post:
I'm a "new" emc2 user. Actually I'm even kind of new to linux too although I played around with Debian and consumed some books some years ago setting up some home server stuff. Whatever, to keep the story short: Went from mach3 to emc2 because of the possibilities and offered freedom to do things that I possibly can't even yet imagine to do (I already love it!). I'm not a OS tinker guy if there is no need (prefer to play with things that fly, drive, move or make chippings instead of operating systems) but as the ultimate goal is making chippings I decided to take the trouble...

And now on the topic:
My emc2 pc is a core2duo, 2.4GHz. For Vga I have a bunch of ATI and Nvidia cards that all work ok and don't influence the latency.
I went somehow crazy about the latency value and played around till I reached <4000ns of jitter under any circumstance I could think of (e.g. 2xfull-hd video playing simultaneously, copying 100gb of data over the network, dd'ing two usb-sticks and two hdds, running 5x glxgears).

My initial max jitter value was 30000ns. With the goal of dedication one core to the rt-engine I researched and found the isolcpus-magic. This Improved things quite a bit but I was still above 10000ns. A little bit more research on the topic of dedicating cpu's helped reducing the latency to the current value of less than 4000ns. What I do is manipulating the IRQ affinity. As I have not found anything about this topic related to linuxcnc I thought that it might make sense to share the resulting scripts and confs and stuff...

First of all my bios settings:
CPU
- disable Vanderpool Technology (cpu vitualization stuff for vm's)
- disable C1E support
- enable TM support (temperature monitoring - will throttle cpu if it overheats e.g. because of fan malfunction - will mess things up but better than a crash and toasted chip)

GENERAL
- disable anything that suggests that it could save power or automatically improve performance (over-clocking)
- enable APIC (advanced programmable interrupt controller), on my machine linux refused to see two separate cores if it was not enabled and furthermore we need it to redirect the IRQs anyway
- disable anything suggested by the makers of linuxcnc (although on my machine nothing else made a difference)

Linux related changes:
- edit /etc/default/grub and add the kernel options "isolcpus=1 acpi_irq_nobalance noirqbalance" (call update-grub afterwards)
- make sure that the software "irqbalance" is NOT installed, remove if it is there (Ubuntu software center -> installed software -> search for irqbalance -> remove)
- add the upstart script "irq-affinity.conf" to /etc/init (see attachments, it will move the irq-handling to the first core)
- add the sh scripts set-irq-affinity and watchirqs to /usr/local/sbin (first allows to set the affinity mask manually, second opens a console window that shows live how the irqs are scheduled to the different cores -> all numeric irqs but 0 should be handled by cpu0)

File Attachment:

File Name: irqstuff.zip
File Size:1 KB


Well, that's it - hope it helps somebody!

Cheers,
Ruben

PS:
Could somebody explain the lut5 function in detail? I used it to configure my combined homing/limit switches for xy by modifying the generated hal file and the signal routing is clear but I really don't understand the function itself even after reading 10+times the description of the function. What I understand is that it is 5 inputs, one output and a 32bit - mask that allows setting the desired in->out behaviour. I would be grateful for some further explanations on that mask. (I'm a software guy so feel free to keep it technical)
Attachments:
Last edit: 14 Dec 2012 00:39 by Nebur. Reason: added attachment
The following user(s) said Thank You: tensor, kivig, Doc Alex, steinbeisser, EraDim

Please Log in or Create an account to join the conversation.

More
10 Dec 2012 12:38 #27493 by PCW
The LUT 5 function probably confused you because its is too simple :-)

The output bit is selected from the mask by the 5 input bits, that is the 5 input bits comprise an address to the bits in the mask and the output is simply the mask value at that bit address. That is a input value 0f 00000b returns the mask bit 0, an input of 00001b returns the mask bit 1 etc etc.

Please Log in or Create an account to join the conversation.

More
10 Dec 2012 17:15 #27498 by ArcEye
Hi

Sounds interesting, look forward to examining the scripts etc.

[How are attachments added? Thought it would be as intuitive as it seems but it does not work.]


It will only accept certain extensions and sizes. Easiest is normally to roll everything up into a .zip, will accept up to 500K in this format.

You don't say so, but I assume you are using the Ubuntu 10.04 based install?

regards

Please Log in or Create an account to join the conversation.

More
10 Dec 2012 22:27 #27518 by BigJohnT
Replied by BigJohnT on topic Reducing latency on multicore pc's - Success!

The LUT 5 function probably confused you because its is too simple :-)

The output bit is selected from the mask by the 5 input bits, that is the 5 input bits comprise an address to the bits in the mask and the output is simply the mask value at that bit address. That is a input value 0f 00000b returns the mask bit 0, an input of 00001b returns the mask bit 1 etc etc.


Confuses me too... so you can have either an and or an xor function and the xor is obtained by summing the weights of bits. So for an example if I set the weight to 0x18 then if bit 2 is the only one on (0x10) the output is on, and if bit 0 and bi1 is the only thing on (0x8) then the output is on. This gives me a combination xor plus an and function. Getting clearer now. Dang that man page need some fixing up.

John

Please Log in or Create an account to join the conversation.

More
10 Dec 2012 23:21 - 10 Dec 2012 23:45 #27520 by PCW
This is hexadecimal so a weight of 0x10 represents bit 4 and 0x18 is bit 3 and bit 4
so 0x18 would generate a function that was true only when the inputs
= 3 or 4. that is inputs = 00011b 0r 00100b (little b means binary)

The 'weight's are a little confusing as they are just a way to easily
specify the 32 by one bit look up table data as a single 32 bit data word
so a weigh of 1 sets bit 0 in the table
a weight of 2 sets bit 1
a weight of 4 sets bit 2
...
a weight 0f 0x80000000 sets bit 31 (the last bit)


If you think of the inputs as an address to the bits (0 through 31) in the function
it should be clearer. So basically you have a look up table that instead of using an
array for the table data, uses a 32 bit data word where the inputs are an index into the
individual bits of the data word.

BTW I would really like to try the latency tweaks and thanks to arceye for explaining the
attachment issues. I tried and failed with attachments before and it good to know the zip trick
Last edit: 10 Dec 2012 23:45 by PCW.

Please Log in or Create an account to join the conversation.

More
11 Dec 2012 01:03 - 11 Dec 2012 01:04 #27523 by BigJohnT
Replied by BigJohnT on topic Reducing latency on multicore pc's - Success!
I'm sure I don't understand what your saying...

Looking at the truth table in the man page

0x8 is bit 0 and bit 1
0x10 is xor bit 2

Testing this in hal and it does work as an and with bit 0 and 1 and as an xor with bit 2, any other combinations and the output is false.

John
Last edit: 11 Dec 2012 01:04 by BigJohnT.

Please Log in or Create an account to join the conversation.

More
11 Dec 2012 01:10 #27524 by PCW
Right
bit 0 and 1 is 3 so 0 _and_ 1 result in a 1 output
as does bit 2 (4)so only input codes of 3 (00011b) and 4 (00100) generate a '1' output
all other codes generate a 0

Please Log in or Create an account to join the conversation.

More
12 Dec 2012 02:07 #27596 by awallin
Replied by awallin on topic Reducing latency on multicore pc's - Success!
Hi Spainman,
I would be interested in trying your real-time tuning tips.
Can you try to post your attachment again? or maybe use pastebin or make a wiki page about it.

I assume you were trying this with Ubuntu 10.04lts and the RTAI kernel?

Anders

Please Log in or Create an account to join the conversation.

More
13 Dec 2012 20:45 #27664 by ArcEye

Hi Spainman,
I would be interested in trying your real-time tuning tips.
Can you try to post your attachment again? or maybe use pastebin or make a wiki page about it.

I assume you were trying this with Ubuntu 10.04lts and the RTAI kernel?

Anders


Ditto again,

Having read up on the subject I can see what you are doing and probably replicate it.

What confuses me is the use of isolcpus as well.
If the kernel option CONFIG_HOTPLUG_CPU is set and allows the 'unmounting' of CPUs, with isolcpus (1,2,3) on a quad core for instance, where else but core 0 can the interrupts be sent?

Are you saying that even with the other cores isolated, irq_balance and APIC are still using the other cores for interrupts?

I shall have to have a play with it.
I can show that for instance using echo "1" > /proc/irq/NN/smp_affinity will move most of the interrupts on IRQ NN to CPU 0.
Mapping all my IRQs by this method to CPU 0, which I assume is what your script does at start up, showed a corresponding reduction in use of the other CPUs (albeit not a complete cessation in use)
(This is without isolcpus on the bootline)

I have various development kernels on this machine, so can try a few different things when time allows, but would be nice to get a bit more detail on your findings that led you here

regards

Please Log in or Create an account to join the conversation.

More
14 Dec 2012 00:54 #27692 by Nebur
Sorry... I was away for a few days and couldn't respond. The initial post is updated now and contains a zip.
Regarding the questions:
- I've installed the machine from the latest linuxCnc Iso so yes, it's Ubuntu 10.04lts.
- isolcpus alone does definately not prevent irq-handling on the isolated core (at least on my machine)
- echoing into the smp_affinity files is what I do, it did not work without the kernel options I mentioned on my first post

The lut5 function does not seem to be that simple ;-D as the discussion shows. Will have to read the posts in detail as soon as I have a little bit more sparetime. Thanks for the information anyway - I'm sure it will help to figure out what it does exactly.

Cheers,
Ruben

Please Log in or Create an account to join the conversation.

Time to create page: 0.123 seconds
Powered by Kunena Forum