Random read errors on Mesa 7i92

More
06 Apr 2016 18:32 #72864 by 10K
I'm running a Mesa 7i92, and I've been getting random read errors. They are of the format
hm2/hm2_7i92.0: error finishing read!  iter=######)
Where the ###### varies.

My observation on when this occurs are as follows:
  • I have run for hours with no errors
  • The errors seem to pop up during intervals when there's no activity (I'm off doing something else)
  • Sometimes I get just one message, and sometimes they occur on about 5 second intervals
  • Sometimes after I get the message, linuxcnc acts like it's working, and the DRO changes on screen. However, the lathe does not move at all. I've pinged the 7i92 when this happens to make sure the ethernet is still connected, and the 7i92 answers back
  • Sometimes I can reboot linuxcnc to fix the problem. Often I have to reboot the computer
    I'm not real sure where to start looking for the root problem. Any advice? This would be a lot easier to debug if someone has had a similar problem!

    I have two ethernet ports. I'm using the internal one for linuxcnc, and a usb port to talk to my local LAN. I've thought about swapping the ports to see if that has any effect.

    INI and HAL files are attached.

File Attachment:

File Name: Monarch_10EE.hal
File Size:7 KB


File Attachment:

File Name: Monarch_10EE.ini
File Size:5 KB
Attachments:

Please Log in or Create an account to join the conversation.

More
06 Apr 2016 18:51 #72866 by PCW
Replied by PCW on topic Random read errors on Mesa 7i92
Sounds like a hardware problem of some kind (especially the having to reboot the PC)
I would try a different PC as the first test
When it gets on the "restarting linuxcnc doesnt fix the problem" state
what are the symptoms?
Does power cycling the 7I92 help?

Please Log in or Create an account to join the conversation.

More
06 Apr 2016 22:09 #72875 by jepler
Right now in 2.7 and master branches, including all released 2.7.x versions, the hm2_eth driver does not recover well if a read request packet or its response is missing. So you need absolutely 0.0000000% ethernet packet loss for hm2_eth to work properly.

When hm2_eth doesn't receive a read response packet, it waits a very long time before continuing. This is long enough to cause missed realtime deadlines and quite possibly enough to cause the 7i92's watchdog to bite.

If you have a stepper machine, the watchdog bite means that the stepper enables will turn off, the step and direction outputs will turn off, but linuxcnc will still get changing feedback position from the stepgen in the mesa card---it is still merrily counting away, but the pulses are not appearing on any physical pin or connector on the mesa board.

Improving this in LinuxCNC is a goal, and I've started work on it, but right now I can't say when the work will be available for general testing through development builds, let alone in a released version of linuxcnc.

Please Log in or Create an account to join the conversation.

More
06 Apr 2016 23:24 #72877 by PCW
Replied by PCW on topic Random read errors on Mesa 7i92
Graceful recovery from dropped packets is an important goal,
but It should be noted that getting _any_ packet drops is fairly unusual on a short
link and in a relatively clean EMI environment. I have 2 test setups that have been running 24/7
for more than a year and these have never dropped a packet (one is running at 4 KHz so that's a lot of packets)

It does sound to me like a hardware issue of some kind

If the watchdog bit there should be a pop-up notifier of this event

Please Log in or Create an account to join the conversation.

More
13 Apr 2016 21:27 #73257 by 10K
Replied by 10K on topic Random read errors on Mesa 7i92
I've made some progress on this. First, I determined that the error is preceded by a latency error message. So I went back to testing latencies.

I found that very infrequently, I was getting extremely high latencies (>10,000,000). Ordinarily, the latencies were around 20,000. That being said, I was able to have acceptable latencies for periods of up to 36 hours. So what I was looking for was a very infrequent event.

I messed around with the two network connections. I disabled each in turn to see if that had an effect. I also swapped the connection to the Mesa from the built-in network adapter to the USB network adapter. After several days, I convinced myself that it probably was not the network.

I also tried going into the UEFI setup, and turning off everything I could. Again, no discernible effect.

I didn't have this problem in my initial testing, and the only thing different was that I was using an ELO touch screen instead of a monitor and a mouse. I had been using the default linux drivers for both. I went to ELO's website and downloaded their linux drivers for the touchscreen and installed them.

Since then, I've run LinuxCNC up to six hours unattended, and have not gotten any errors. I have my fingers crossed that this was the problem.

Please Log in or Create an account to join the conversation.

More
30 Jun 2016 14:49 #76847 by 10K
Replied by 10K on topic Random read errors on Mesa 7i92
I've done some more work on this, but still have not resolved the problem.

First, I found that making the computer use only one processor eliminated the huge latencies I was experiencing:

edit /etc/default/grub
change line 
	GRUB_CMDLINE_LINUX_DEFAULT="splash"
to
	GRUB_CMDLINE_LINUX_DEFAULT="splash nomodeset maxcpus=1)
	(tried isolcpus=0,1 idle=poll, but turning off processor works better)

Then
	sudo update-grub

I also made a change to my /etc/network/interfaces file to reduce the time it waited:

hardware-irq-coalesce-rx-usecs 0

After making these changes, I ceased to see huge latencies. Here's a recent graph showing latencies



This particular graph ran for about 15 minutes. I've run it and the Latency Test several times for > 12 hours, some while running GLXgears and playing MP3 files, and the maximum servo latency is 35,000 to 55,000, with most servo latencies in the 15,000 range. I would think that for RT-Preempt and the Mesa card, these are acceptable latencies.

However, I'm still getting errors after 1 1/2 - 6 hours of CNC use. They take the form of

unexpected real time delay on task 0. This message will only display once. iter = xxxxxxxx
hm2/hm2_7i92.0: error finishing read! iter = xxxxxxxxxx)


The iter number is generally in the millions to hundreds of millions range.

Once I've seen this message, the LinuxCNC display will show the X and Z axis moving, but the servos don't run. I have to exit LinuxCNC and restart it.

I've tried a great number of things to get this working. I've changed the BIOS settings. I've changed the Grub settings. I've installed new drivers for the Intel video, the WiFi, and Bluetooth. Testing is a lengthy process when an error can take many hours to show itself!

I'm running out of ideas. Has anyone else encountered long duration errors that don't show up in the Latency Test? Are there any tools to figure out what's causing the problem? Any ideas? Thanks.

Please Log in or Create an account to join the conversation.

More
30 Jun 2016 15:10 #76848 by PCW
Replied by PCW on topic Random read errors on Mesa 7i92
This could be caused by a dropped packet so it may be a hardware error on the host PC or the 7I92
it could also be EMI related especially if you only see errors when noise sources like a VFD are on

Assuming its a dropped packet problem related to noise, i would first try re-routing the Ethernet cable
away from noise sources (power and motor wires) and perhaps using shielded cat5

Another option is to try jeplers experimental eth-packet-loss branch of linuxcnc that copes better with dropped packets

Another quick way to bisect the problem is try another PC

Please Log in or Create an account to join the conversation.

More
02 Jul 2016 20:55 #76928 by 10K
Replied by 10K on topic Random read errors on Mesa 7i92
EMI interference is something that I had considered before. I don't have a VFD, and I didn't think I had a source of high frequency interference, but I though I'd look into it anyway.

I didn't know they made shielded Ethernet cables. I don't have any, so I grabbed another regular cable and made the connection. I hung the wire from the ceiling so it was not around any potential sources of EMI.

First test - 8 hours without an error.
Second test - 13 hours without an error. Went to be for night, and in morning had an error.

This is very promising, and is the longest run times I've gotten yet. I've ordered a shielded cable to try out. Is there anything special I should do around the 7i92 board itself?

Here's a picture of my control box. The Mesa card is in the lower right corner. Other components are a switching power supply for the steppers, a 24VDC power supply, a 24VDC-5VDC buck, four stepper drivers, and two breakout boards. I would think that the switching power supply is OK since it's in a separate metal enclosure. The buck is not enclosed, and maybe it could be a problem. The computer is mounted on the back of a touchscreen, and it's possible that's making some EMI.

The old ethernet cable ran with some other cables. The only thing with any AC is the signal from the spindle encoder.

Please Log in or Create an account to join the conversation.

More
02 Jul 2016 21:55 #76929 by PCW
Replied by PCW on topic Random read errors on Mesa 7i92
Normally Ethernet has very good noise immunity so I still am suspicious that something else is going on
the shielded cable is simple so worth a try though

( also unfortunately random failures are well... random so you might just be having a good day)

I would till still try the packet loss branch or a different PC at least as a test

Please Log in or Create an account to join the conversation.

More
04 Jul 2016 01:55 #76953 by tommylight
Are those BOB's Chinese made 5 axis ones? I would highly recommend changing them, especially if you are using inputs on them.
Or you will have to add 470 ohm pull up resistors to all inputs and test if whatever is connected to them can pull them down as necessary and invert them in hal as needed. They are very susceptible to interference, all kinds of it.
I have to agree with PCW, it sure looks like interference from something, not just RF, all bigger equipment power wise will cause magnetic fields on supply wires, so using a shielded cable should be first order of business.
Regards,
Tom

Please Log in or Create an account to join the conversation.

Moderators: PCWjmelson
Time to create page: 0.190 seconds
Powered by Kunena Forum