End-to-end recognition means that the whole image is feed into the network and the network will output the recognition result for the whole image.
The input image to the network in ASTER is not the whole image, but a small part containing the warped text. I think it is more proper to call ASTER a recognition algorithm which can deal with irregular text images.