【consult🙋‍♂️】some problem about model&data #131

TITC · 2022-04-21T12:26:29Z

TITC
Apr 21, 2022
Collaborator

the encoder architecture used in this project is a Hybrid (ResNet&CNN) architecture, have you tried ViT without CNN backbone and how about it?
I know you recommend CNN backbone at some suggestions #112 (comment) ,but the author suggests, the original ViT is better than Hybrid except the small model size, so I have some curiosity.

Vision Transformers generally outperform ResNets with the same computational budget. Hybrids improve upon pure Transformers for smaller model sizes, but the gap vanishes for larger models.

Have you experimented with some other position embedding strategy? If I haven't misunderstood your meaning at 【consult🙋‍♂️】 some confusion about CustomVisionTransformer.forward_features #130 (comment), I think the position embedding manner is different with ViT suggest 1D position embedding.
How many data you used for training the released model? you mentioned wikipedia, arXiv, im2latex-100k dataset, and provide a pre-processed data link. But the number of LaTex list below not equal to the number of picture, so I guess maybe your training set is larger than released?

CROHME.zip
10822 handwritten dataset(CROHME) and 10846 line LaTex formula.
formula.zip
234884 line LaTex formula, 158480 for train, 6765 for val, and 30637 for test.

I have found some other dataset, such as I2L-140K, Im2latex-90k, Marmot Dataset, have you used them?

Answered by lukas-blecher

Apr 21, 2022

Initially, I used a pure ViT (6ecc3f4). But the encoder was just not performing very well. The model produced latex code but it has nothing to do with the input image. Given that the model is very small, I can confirm the authors findings.
Well it still is a 1D embedding. I use the same strategy as the authors only in a more generalized way.
That is all the data I rendered out for training. The reason the latex file has more entries is that it also contains formula of equations I wasn't able to render (because they threw some errors). I had to leave them in the file because I was lazy and my dataloader would break without them. I will probably return to that problem and compile a better,…

View full answer

lukas-blecher · 2022-04-21T12:54:40Z

lukas-blecher
Apr 21, 2022
Maintainer

Initially, I used a pure ViT (6ecc3f4). But the encoder was just not performing very well. The model produced latex code but it has nothing to do with the input image. Given that the model is very small, I can confirm the authors findings.
Well it still is a 1D embedding. I use the same strategy as the authors only in a more generalized way.
That is all the data I rendered out for training. The reason the latex file has more entries is that it also contains formula of equations I wasn't able to render (because they threw some errors). I had to leave them in the file because I was lazy and my dataloader would break without them. I will probably return to that problem and compile a better, bigger dataset.
I was not familiar with the Marmot dataset. Will have a look at that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

【consult🙋‍♂️】some problem about model&data #131

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

【consult🙋‍♂️】some problem about model&data #131

Uh oh!

TITC Apr 21, 2022 Collaborator

Replies: 1 comment

Uh oh!

lukas-blecher Apr 21, 2022 Maintainer

TITC
Apr 21, 2022
Collaborator

lukas-blecher
Apr 21, 2022
Maintainer