Espruino Wifi Socket recovery #4686
Replies: 1 comment
-
Posted at 2018-09-29 by @allObjects I'm not the socket specialist, but I see some discrepancy between the code pattern as posted and as found on Espruino internet sockets documentation page. From your code I read that it is a client... and the events are on the client and not the socket directly, and the client is the result of a Posted at 2018-09-29 by SteveHayles Thanks @allObjects. I have rewritten the socket code many times to try and make this work but the result is always similar. I have refactored things now to be more inline with the guide (although the API reference shows a similar pattern to the one above). Here is the exact code of my running application
The EventBus is a simple implementation that queues messages in a simple pub / sub pattern. The EventBus.subscribe('gpsData',... code is simply pulling data out of a queue from the Serial Port. The server is a separate application that I cannot alter but I have stress tested it and it will handle hundreds of concurrent connections without difficulty. Debugging the state of the client when the 'end' event is called it looks like
and then 5 seconds later in the reconnection it appears to be properly initialized
but will immediately be closed again and won't sent data. Ignoring the 'end' and leaving the code still running will report (obviously!)"function called from system Uncaught Error: This socket is closed" I agree that blind socket reconnects don't always make sense but the link is still up the server is still running and the code doesn't call for a disconnect. Happy to do any amount of debugging but very keen to make this work. Thanks Posted at 2018-09-29 by Wilberforce Push out the 5 secs - make it say 30 or 60 seconds. I would guess that you are running out of sockets. You could tell by, on connect, print out the socket number. These should get reused - but I think you are re- connecting so quickly that they don't get a chance to get reused. Posted at 2018-09-29 by SteveHayles Hi, I will extend the timeout (I am fairly sure I have tried this already). I have also checked the socket number and it's the same on both occasions (when initially opening the connection and subsequently on the reconnect). I would have thought the library would have returned an error in the event of being starved of socket resources ? My feeling is that the socket 'end' event is erroneous in some way; My next attempt will be to build a small desktop 'proxy' so that I have more control over the other end of the link. I am also trying to capture some Wireshark logs but they are never easy to debug when the event is intermittent and hard to force. Thanks for your thoughts Posted at 2018-10-01 by SteveHayles Hi, I should have consulted my notes from a year ago, this problem is linked with an issues I raised here previously (http://forum.espruino.com/conversations/304601/#comment13692804 ) about sockets being closed erroneously during a 'send'. I did some work on this and had comms with @gfwilliams but then had to move on to other projects. The problem will occur very quickly if you open one socket and start sending data in a loop whilst you receive data on another socket. The EspruinoWifi code for sending via the AT commands of the ESP8266 has a final else statement that catches anything other than "OK", "SEND OK" , "Recv" or "Busy s..." and in that event nulls that socket.
It all looks OK as the IPDHandler should deal with anything else BUT the reality is that 'd'(the data provided in the at.cmd callback) often contains incoming data meant for the other socket. It looks like data is being 'lost' within the Espruino stack rather than being lost packets from the ESP8266. In my case the incoming and outgoing data will use checksums so some corruption is OK and to simply keep the link up and running I have amended the code as follows
This is a pretty ugly hack but it proves the issue and I have had the link running all night when it previously wouldn't last a few minutes. I tried passing the data back to the ipdHandler (commented line above) but it doesn't 'repair' the incoming data. One important observation is that the erroneous buffer always contains the 'end' of an incoming 'line' meant for the other socket. It looks very much like we are missing the start of the IPD sentence coming from the ESP2866. I will try and debug at the AT module level but hopefully it gives @gordon something to go on. Posted at 2018-10-01 by @gfwilliams Hi, I've run out of time for today, but I'll take a look at this tomorrow. Posted at 2018-10-02 by @gfwilliams Thanks for all your work looking into this - that helps a great deal! I just came up with a test for this where the WiFi opens two sockets to a server - one of them blasts out data, and then the WiFi tries to write to the other. I can get it to fail within a few seconds, and I'm working on a fix now. Posted at 2018-10-02 by SteveHayles Hi, I have spent more time on it and there is definitely some issues around the parsing inside the It does feel like the parsing is slightly fragile although the constraints of the environment and the language don't make it easy. Thanks for looking at it, I think it might be the cause of a few issues for others at times so well worthwhile. Thanks, Steve Posted at 2018-10-02 by @gfwilliams Ok, should now be fixed on cutting edge builds. It seems there were two issues:
edit: There is still some strangeness (which has been a longstanding thing with Espruino) where if you write a bunch of stuff to USB and you write a lot of stuff to Serial, occasionally you end up with a duplicated character. It means that if you turn on Posted at 2018-10-02 by SteveHayles OK, sounds great. Haven't loaded a cutting edge build, exactly which firmware is used on the Espruino Wifi ?? Thanks Steve Posted at 2018-10-02 by SteveHayles Hi, Looking at this fix (haven't tested it yet) it looks very simple but doesn't feel quite right. Are you sure it will handle the more standard situation with the "> " arriving in the next data after ("OK"). Maybe there is something in the underlying native code that needs updating as well ? Do I need a new firmware to go along with the changes in AT.js and EspruinoWifi.js ? Thanks, Steve Posted at 2018-10-02 by @gfwilliams You'd normally get the builds from http://www.espruino.com/binaries/travis/master/, and it's the You shouldn't need a new firmware... I think the handling is ok - it does Posted at 2018-10-02 by SteveHayles Hi @gfwilliams, Sorry that this is dragging on, for me adding the new files (AT.js and EspruinoWifi.js) results in not being able to do simple sends. No stack trace obviously but it looks like the At this stage (after the lineCallback and before I think this leaves the ESP8266 module waiting for data and therefore not sending anymore and we have effectively discarded the incoming "> ". If I add a check in the end of the loop as follows then it works but obviously this has no place in the AT module
I am trying to work out what I have missed but the files are straight cut and paste from your recent commits. Thanks for your ongoing help, this module is in a very important part of a project I am working on and we are very keen to get this working. Steve Posted at 2018-10-02 by @gfwilliams Hi Steve, Ok sorry about this - I assumed that the ESP8266 was actually just sending Give me a half hour or so and I'll get a new build. On new builds, AT and EspruinoWiFi libs are actually built in, so you shouldn't have to worry about updating them. A new firmware should have them included. Posted at 2018-10-02 by @gfwilliams If you try a build now, it should be fixed. Posted at 2018-10-02 by SteveHayles Awesome ! The simple fixes are the best and I really appreciate your support. Not wanting to to drag this out but I am sure you want to make this 'networking' as robust as possible. I see at least one 'edge' case that would result in a tight infinite loop right now (which I assume locks pretty hard on this platform). It's an easy fix I think and I am happy to do it my end but probably best you maintain your own code base. The handlers loop in the Probably just a simple precondition vs result check with a break would solve it and if the handler didn't consume anything then it must be OK to continue ? Thanks again, Steve Posted at 2018-10-03 by @gfwilliams Thanks - great spot! You could still ctrl-c out of that with Espruino, but it's definitely not good news! Just fixed :) Posted at 2018-10-03 by SteveHayles @gfwilliams, same fix as me, thanks for the continued support. This has been a massive step forward and the board has been running all night with one socket broadcasting UDP data from the serial port and another one receiving an incoming TCP stream. I think this will help a number of other libraries (MQTT etc) to appear way more stable. Nice work ! The only other small one I had to catch was quite subtle so I will report it. Occasionally the AT module only receives single bytes at a time and at the end of one of these periods there is sometimes a character missing. A typical sequence looks like this "\r\nRecv 31 bytes\r\n\r\nSEND OK\r\n" and everything is fine but sometimes the last \n is missing and the sequence looks like "\r\nRecv 31 bytes\r\n\r\nSEND OK\r\n" The logic of the parser then breaks as everything is split on \r\n. My (hacky) solution was to check for the situation within the AT module as follows, but it seems to be hiding an underlying issue.
I have seen other situations with a UART that will miss just the last character when receiving data. I have tried running the Wifi serial at everything from 115200 all the way through to 921600 with a similar result. Posting more for your information than anything else as I can fix it in my implementation. Regards, Steve Posted at 2018-10-03 by @gfwilliams Hi Steve, thanks! Any chance you could put a USB-TTL UART in there on the RX line so we can see if that missing Annoyingly I had another bug recently reported because the AT module used just to split on Unfortunately I think your fix may well fail in the case that you get socket data that contains just a In the current builds, presumably that error just causes a socket timeout which resets your connection? It's not then a big deal to trap and reconnect - which you'd have to do because WiFi could be flaky? Best solution could be to revert that AT command lib change and to just tweak the offending library to use a 'handler' rather than a 'linehandler' Posted at 2018-10-03 by SteveHayles Hi, I will dig out a USB-TTL converter and try my other UART code but I am I right in thinking that there is no exposed Rx line to the ESP8266 on an Espruino Wifi board ? The code wasn't really a proposed fix, I think it helps explain and debug the problem more than anything. I wouldn't want you to bounce around with the AT handler, this is a genuine exception created elsewhere. For sure I can catch the error and close / reopen the socket and it's not stopping me making progress. One interesting observation is that I have written a small conditional debug log for when this happens and in several hours of running I have had around 100 of these situations and every one of them has happened during the OK response to an AT+CIPSEND. Don't spend too long on it ! Regards, Steve Posted at 2018-10-04 by @gfwilliams
There isn't but the pin spacing on the ESP8266 isn't too bad - in fact there are little holes in the pads IIRC, so if you flip the board over you can just poke a bit of solid core wire in there.
Wow, that's a lot. Something I'd definitely want to get fixed. The fact it always happens in the same place makes me think it could be ESP8266 - in which case reverting the changes to the AT handler really might be the best bet |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Posted at 2018-09-28 by SteveHayles
Hi,
Using an Espruino Wifi with the 'EspruinoWifi' module running Firmware 1.99.
The question may expand into other issues but directly my question is around an un-recoverable socket disconnection. I am using the 'Net' module and connecting to a simple Tcp Server. All works well until the socket appears to be closed (either from native code or the module itself but NOT from my code)
This might be a transient socket disconnection for whatever reason but the main problem is that I can't reconnect. Subscribing to the socket.on('close') event works and I have tried a simple reconnect via a timeout function. This function runs and the callback passes an empty 'err' object and a complete socket as it does first time round but no data is received and the 'close' event is triggered again instantly.
My code looks like this
In summary, it works great until the socket closed event gets called. The 2nd connect runs with no error and return a socket object but no data is received and the close event fires immediately.
Any thoughts on how to recover from the socket disconnection or maybe I can do something to prevent it. Thanks
Beta Was this translation helpful? Give feedback.
All reactions