Random read errors on Mesa 7i92
- 10K
- Topic Author
- Offline
- Premium Member
- Posts: 132
- Thank you received: 31
hm2/hm2_7i92.0: error finishing read! iter=######)
My observation on when this occurs are as follows:
- I have run for hours with no errors
- The errors seem to pop up during intervals when there's no activity (I'm off doing something else)
- Sometimes I get just one message, and sometimes they occur on about 5 second intervals
- Sometimes after I get the message, linuxcnc acts like it's working, and the DRO changes on screen. However, the lathe does not move at all. I've pinged the 7i92 when this happens to make sure the ethernet is still connected, and the 7i92 answers back
- Sometimes I can reboot linuxcnc to fix the problem. Often I have to reboot the computer
-
I'm not real sure where to start looking for the root problem. Any advice? This would be a lot easier to debug if someone has had a similar problem!
I have two ethernet ports. I'm using the internal one for linuxcnc, and a usb port to talk to my local LAN. I've thought about swapping the ports to see if that has any effect.
INI and HAL files are attached.
Please Log in or Create an account to join the conversation.
- PCW
- Online
- Moderator
- Posts: 17900
- Thank you received: 4783
I would try a different PC as the first test
When it gets on the "restarting linuxcnc doesnt fix the problem" state
what are the symptoms?
Does power cycling the 7I92 help?
Please Log in or Create an account to join the conversation.
- jepler
- Offline
- Administrator
- Posts: 74
- Thank you received: 39
When hm2_eth doesn't receive a read response packet, it waits a very long time before continuing. This is long enough to cause missed realtime deadlines and quite possibly enough to cause the 7i92's watchdog to bite.
If you have a stepper machine, the watchdog bite means that the stepper enables will turn off, the step and direction outputs will turn off, but linuxcnc will still get changing feedback position from the stepgen in the mesa card---it is still merrily counting away, but the pulses are not appearing on any physical pin or connector on the mesa board.
Improving this in LinuxCNC is a goal, and I've started work on it, but right now I can't say when the work will be available for general testing through development builds, let alone in a released version of linuxcnc.
Please Log in or Create an account to join the conversation.
- PCW
- Online
- Moderator
- Posts: 17900
- Thank you received: 4783
but It should be noted that getting _any_ packet drops is fairly unusual on a short
link and in a relatively clean EMI environment. I have 2 test setups that have been running 24/7
for more than a year and these have never dropped a packet (one is running at 4 KHz so that's a lot of packets)
It does sound to me like a hardware issue of some kind
If the watchdog bit there should be a pop-up notifier of this event
Please Log in or Create an account to join the conversation.
- 10K
- Topic Author
- Offline
- Premium Member
- Posts: 132
- Thank you received: 31
I found that very infrequently, I was getting extremely high latencies (>10,000,000). Ordinarily, the latencies were around 20,000. That being said, I was able to have acceptable latencies for periods of up to 36 hours. So what I was looking for was a very infrequent event.
I messed around with the two network connections. I disabled each in turn to see if that had an effect. I also swapped the connection to the Mesa from the built-in network adapter to the USB network adapter. After several days, I convinced myself that it probably was not the network.
I also tried going into the UEFI setup, and turning off everything I could. Again, no discernible effect.
I didn't have this problem in my initial testing, and the only thing different was that I was using an ELO touch screen instead of a monitor and a mouse. I had been using the default linux drivers for both. I went to ELO's website and downloaded their linux drivers for the touchscreen and installed them.
Since then, I've run LinuxCNC up to six hours unattended, and have not gotten any errors. I have my fingers crossed that this was the problem.
Please Log in or Create an account to join the conversation.
- 10K
- Topic Author
- Offline
- Premium Member
- Posts: 132
- Thank you received: 31
First, I found that making the computer use only one processor eliminated the huge latencies I was experiencing:
edit /etc/default/grub
change line
GRUB_CMDLINE_LINUX_DEFAULT="splash"
to
GRUB_CMDLINE_LINUX_DEFAULT="splash nomodeset maxcpus=1)
(tried isolcpus=0,1 idle=poll, but turning off processor works better)
Then
sudo update-grub
I also made a change to my /etc/network/interfaces file to reduce the time it waited:
hardware-irq-coalesce-rx-usecs 0
After making these changes, I ceased to see huge latencies. Here's a recent graph showing latencies
This particular graph ran for about 15 minutes. I've run it and the Latency Test several times for > 12 hours, some while running GLXgears and playing MP3 files, and the maximum servo latency is 35,000 to 55,000, with most servo latencies in the 15,000 range. I would think that for RT-Preempt and the Mesa card, these are acceptable latencies.
However, I'm still getting errors after 1 1/2 - 6 hours of CNC use. They take the form of
unexpected real time delay on task 0. This message will only display once. iter = xxxxxxxx
hm2/hm2_7i92.0: error finishing read! iter = xxxxxxxxxx)
The iter number is generally in the millions to hundreds of millions range.
Once I've seen this message, the LinuxCNC display will show the X and Z axis moving, but the servos don't run. I have to exit LinuxCNC and restart it.
I've tried a great number of things to get this working. I've changed the BIOS settings. I've changed the Grub settings. I've installed new drivers for the Intel video, the WiFi, and Bluetooth. Testing is a lengthy process when an error can take many hours to show itself!
I'm running out of ideas. Has anyone else encountered long duration errors that don't show up in the Latency Test? Are there any tools to figure out what's causing the problem? Any ideas? Thanks.
Please Log in or Create an account to join the conversation.
- PCW
- Online
- Moderator
- Posts: 17900
- Thank you received: 4783
it could also be EMI related especially if you only see errors when noise sources like a VFD are on
Assuming its a dropped packet problem related to noise, i would first try re-routing the Ethernet cable
away from noise sources (power and motor wires) and perhaps using shielded cat5
Another option is to try jeplers experimental eth-packet-loss branch of linuxcnc that copes better with dropped packets
Another quick way to bisect the problem is try another PC
Please Log in or Create an account to join the conversation.
- 10K
- Topic Author
- Offline
- Premium Member
- Posts: 132
- Thank you received: 31
I didn't know they made shielded Ethernet cables. I don't have any, so I grabbed another regular cable and made the connection. I hung the wire from the ceiling so it was not around any potential sources of EMI.
First test - 8 hours without an error.
Second test - 13 hours without an error. Went to be for night, and in morning had an error.
This is very promising, and is the longest run times I've gotten yet. I've ordered a shielded cable to try out. Is there anything special I should do around the 7i92 board itself?
Here's a picture of my control box. The Mesa card is in the lower right corner. Other components are a switching power supply for the steppers, a 24VDC power supply, a 24VDC-5VDC buck, four stepper drivers, and two breakout boards. I would think that the switching power supply is OK since it's in a separate metal enclosure. The buck is not enclosed, and maybe it could be a problem. The computer is mounted on the back of a touchscreen, and it's possible that's making some EMI.
The old ethernet cable ran with some other cables. The only thing with any AC is the signal from the spindle encoder.
Please Log in or Create an account to join the conversation.
- PCW
- Online
- Moderator
- Posts: 17900
- Thank you received: 4783
the shielded cable is simple so worth a try though
( also unfortunately random failures are well... random so you might just be having a good day)
I would till still try the packet loss branch or a different PC at least as a test
Please Log in or Create an account to join the conversation.
- tommylight
- Away
- Moderator
- Posts: 19430
- Thank you received: 6517
Or you will have to add 470 ohm pull up resistors to all inputs and test if whatever is connected to them can pull them down as necessary and invert them in hal as needed. They are very susceptible to interference, all kinds of it.
I have to agree with PCW, it sure looks like interference from something, not just RF, all bigger equipment power wise will cause magnetic fields on supply wires, so using a shielded cable should be first order of business.
Regards,
Tom
Please Log in or Create an account to join the conversation.