Need network troubleshooting help #6666
Replies: 1 comment
-
Posted at 2015-12-16 by tve Couldn't get it into one post: the output of the trace(req) call is:
What looks a little suspicious to me is the
Posted at 2015-12-16 by tve Mhh, I'm now seeing that Posted at 2015-12-16 by tve Running this some more, I hit:
It does smell like some JSvars corruption... If you have any hints on what to look for, that might help me a lot tracking it down! Mhh, I have the sneaky feeling that the answer is going to be "you must be unlocking something that is still used afterwards"... If that's the case, how to best troubleshoot it? Posted at 2015-12-16 by @gfwilliams
Yes, when dumping, so as not to get in infinite loops, it only prints things once - so if it printed it above it'd add Sounds like an unlock as you say - IIRC the builtins are checked for by looking at The fact it's actually all vaguely intact (vars pointing to other vars) looks like it's not actual corruption. It is probably just accidentally freeing something. One thing that might help is to run Do you have any kind of debugger? Sometimes when this kind of thing happens (if it's repeatable) I figure out what the variable number was, and then add a conditional breakpoint on The other option is @Kolban had spent some time making a windows-based test harness for the ESP8266 code. I'm not sure if it's working, but it is in the repo. If you can run that locally you might be able to reproduce the issue on a PC, and then you can debug it better? Also, you can try editing Posted at 2015-12-17 by tve Gordon, could you tell me where 'key' gets locked in https://github.com/espruino/Espruino/blob/master/libs/network/jswrap_net.c#L169-L172
Posted at 2015-12-17 by tve I added some printf for each Alloc and each Free. Starting from the beginning, this is the original JS code:
Here is what happens when it crashes (always on the 255th http request!):
Notice how the proto links to #373 and that's occupied by on_connect. Posted at 2015-12-17 by tve Here's the debug output:
So #373 gets freed in clientRequestNew and the "1", "2", ... markers are from printf's that divide up the function:
So the proto somehow gets freed in Posted at 2015-12-17 by tve Mhh, to me it looks like connection objects just don't get freed. After running one HTTP request I see the following if I do a trace():
I thought the connection gets removed from the HttpCC array here: https://github.com/espruino/Espruino/blob/master/libs/network/socketserver.c#L548 and I verified with a printf that this indeed gets executed. I must not be understanding something correctly... Posted at 2015-12-17 by @gfwilliams
It's here? https://github.com/espruino/Espruino/blob/master/libs/network/jswrap_net.c#L161 When you create something new, there's one lock on it already. The lock count is effectively the amount of pointers there are floating around.
That's very interesting... If you compare the Could you add an https://github.com/espruino/Espruino/blob/master/src/jsvar.c#L566 .. so checking that adding 1 didn't cause the ref count to overflow. Generally I haven't bothered checking for overflowed refs because the reffing only gets done internally and doesn't usually go wrong, but it could be that in this case something has gone wrong.
It could be there's a circular reference - for instance the scope of a It'd mean it wasn't automatically freed, but a mark/sweep garbage collection pass (which happens when Espruino is idle or runs out of memory during execution) would realise and then free it. Posted at 2015-12-17 by @gfwilliams Strange about I guess it's possible that on normal Espruino boards this doesn't get hit (often) because the reference count is 16 bit. If that's the case then recompiling on Linux with 1000 vars (as suggested above) might let you reproduce it. I'll give it a try. Posted at 2015-12-17 by @allObjects Is automatic memory management (acquire/release) done with just reference counting? edit: no response needed... found what I looked for: mark/swep. Posted at 2015-12-17 by @gfwilliams Just reproduced it on PC :) I'll have a go at a fix - thanks for tracking this down... Posted at 2015-12-17 by @gfwilliams
No... There's reference counting to free 99% of variables right away, but then it does a mark/sweep when idle or if it can't allocate any memory. Ok, it's definitely an issue with references - see the r255 on the prototype here:
It's being referenced a lot, but not unreferenced. Looks like this could be a big issue with garbage collection, that when it frees things it isn't unreffing what they point to! I'll try and come up with some code to break it :) Posted at 2015-12-17 by @gfwilliams And yes, just broke it with:
Wow, that's a huge bug! Really surprised that's existed for so long and we never noticed :) Posted at 2015-12-17 by @gfwilliams Ok, just fixed. Thanks @tve - that's a hugely important one to have found! Sorry it was such a pain to track down. The issue was that the garbage collector would mark/sweep and would remove any vars it found that weren't referenced elsewhere. However, it assumed that once it found an object, that whole object would get removed so to do things quickly it just zeroed everything. Unfortunately if that object referenced something that shouldn't get garbage collected, the reference count in that non-gc'd object wouldn't get decremented :( In normal Espruino boards the reference count is 16 bits, and on Linux it's 32 bits, so most testing would never have hit it. Having said that, a normal Espruino board doing any HTTP request would fail after 65536 requests - so at one request a second it still would have broken after 24 hours! I'm actually kind of surprised about that because I've got an Espruino in the loft that's been doing HTTP requests every hour for months now - although I wonder whether we have enough power cuts here that it gets reset before there's a problem :) Posted at 2015-12-17 by Ollie @gfwilliams would this manifest itself just in HTTP or would anything dependent on sockets be impacted? Posted at 2015-12-17 by @gfwilliams I think anything using sockets - there are probably a few other cases where it could crop up too. If you try http://www.espruino.com/binaries/git/commits/master/ now, those issues should be fixed though. Posted at 2015-12-17 by @allObjects Interesting, nice little caveat... If mark/sweep is/was trumping reference count (which it should anyway), then (small) 'floating rings/clusters' have/would have 'gone after a while/with many marks and sweeps' anyway... but obviously, leaving other objects with incorrect reference count. In your breaking example, did if Early 90' when writing a oo vm in PL/1 on main pc for use on mainframe, I started out with reference counting... and it got pretty far... but only mark/sweep pulled true... so I abandoned reference count altogether (and would have lost the quick 99% freeing when memory would have been used with linked lists... ;-) ). Reference count was not even usable for validation/confirmation of freeing with single mark/sweep and multiples (I thought) I could not afford (next to the additional storage to remember that the last mark/sweep. Posted at 2015-12-17 by @gfwilliams Yes, Because of the way Espruino works, there are a lot of small allocations. For example What I really wanted to avoid was having code like this:
That had to pause in the middle of execution to garbage collect all the copies of the array Posted at 2015-12-17 by @gfwilliams Having said that, I guess it might be an option to actually remove reference counting on really resource constrained devices - it'd reduce code size and memory usage, at the expense of having to run a gc pass more often. Posted at 2015-12-17 by tve
Awesome, glad I produced enough chatter for you to realize where the issue was. ... and great fix turn-around! I'll give it a go later today. Posted at 2015-12-17 by tve It looks like the changes to JSvars make it impossible to upgrade a system that has something save()'ed? Notice how the names of variables are truncated:
Posted at 2015-12-17 by @gfwilliams Oh, that'll be because the ordering of the enums changed when I added Native Strings. To be honest that's always a problem - see the big red text at the bottom of http://www.espruino.com/Other+Boards The Espruino bootloader blows away any saved code on upgrade. To be honest even if something like that enum doesn't stop it from working, because it stores pointers to functions inside native functions any update is likely to stop it working. Posted at 2015-12-17 by tve Sounds like there should be some version number embedded in the save()'ed stuff so it can be discarded when the number changes? Posted at 2015-12-18 by @allObjects Great explanation, and I thinks it - chaining small blocks - is a perfect compromise between contiguous bytes that need always a marks/sweep for memory reuse, even if declared as not-used/refd-anymore by refCount. I moved away from the refCount even though I had the unused memory peaces linked... but going always through the link and find an unused 'string' greater or equal the needs and re-link the remainder was not an option... last but not least because it would have challenged the virtual memory implementation... In my memory manager (for MRAM/FRAM) I do though so but only limited: since only strings are used, I update as long as the string fits, and if shorter, it has to be short enough to also fit the linking for the remainder... Posted at 2015-12-18 by tve Well, the GC fix did the trick! My little test program is now past 10000 requests, including about 10 errored requests (it's wifi after all, I guess). So far I'm not seeing any memory leak and it hasn't gone kaboom either :-) Posted at 2015-12-18 by @gfwilliams
Thing is it would have to change for every compile, since if the address of one function changes, that could mess the symbol table up. The chances of making a change where the addresses of all the functions stay the same is pretty small. I guess you could just use the MD5sum of the binary as a version number, but IMO it's better to just delete the saved code each time you reflash. ... with the symbol table changes I had planned the situation might improve, but right now I'd suggest just erasing that block of memory. Posted at 2015-12-19 by tve Ah, this is something I haven't looked into at all... So where the saved code references native functions it contains the function's addresses? That would indeed be an issue... |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Posted at 2015-12-16 by tve
I'm having a problem with http / sockets on the esp8266 that falls into the "how can this possibly be?" category at the moment. I have a little test program that issues a POST request every 200ms and after a few hundred requests the object returned by
http.request
ends up not having anend()
function! The end function is part of the prototype, if I understand correctly, so that must get messed-up somehow? I'm looking for some guidance about what to try, where to insert checks, etc.The same test program runs fine on linux, so either there is timing involved or something somewhere else in the esp8266 implementation that is messing JSvars up, at this point I don't understand what kind of messing up I should even be looking for...
The test program is:
The output is:
Beta Was this translation helpful? Give feedback.
All reactions