Handwritten documents near-duplicate search for data intensive applications
- 作者: Varlamova K.1,2, Kaprielova М.1,2,3, Potyashin I.1,2, Chekhovich Y.1
-
隶属关系:
- Antiplagiat Company
- Moscow Institute of Physics and Technology
- FRC CSC RAS
- 期: 编号 4 (2024)
- 页面: 129-139
- 栏目: ARTIFICIAL INTELLIGENCE
- URL: https://kazanmedjournal.ru/0002-3388/article/view/676402
- DOI: https://doi.org/10.31857/S0002338824040085
- EDN: https://elibrary.ru/UEFADS
- ID: 676402
如何引用文章
详细
The problem of cheating in handwritten academic essays has become more significant over last several years. One of the cheating cases is submitting the same paper, photographed in different environment (for example, from another angle, in different light or in lower quality), or changed by means of automatic augmentation. The existing methods are not designed to work on large collections of handwritten documents. The proposed approach consists of three stages. The first stage is embedding generation, the second one is finding closest candidates in the collection of handwritten documents and the final one is similarity estimation between query image and each of candidates obtained at previous step. Our solution showed Recall@1 80% and 59% with FPR 4.8% and 5.5% on Synthetic and Real data respectively. The search latency is 5.5 seconds per query for the collection of 10 000 images. The results showed that the developed method is robust enough to work on large collections of handwritten documents.
全文:

作者简介
K. Varlamova
Antiplagiat Company; Moscow Institute of Physics and Technology
编辑信件的主要联系方式.
Email: kvarlamova@ap-team.ru
俄罗斯联邦, Moscow; Moscow
М. Kaprielova
Antiplagiat Company; Moscow Institute of Physics and Technology; FRC CSC RAS
Email: kaprielova@ap-team.ru
俄罗斯联邦, Moscow; Moscow; Moscow
I. Potyashin
Antiplagiat Company; Moscow Institute of Physics and Technology
Email: potyashin@ap-team.ru
俄罗斯联邦, Moscow; Moscow
Yu. Chekhovich
Antiplagiat Company
Email: chehovich@ap-team.ru
俄罗斯联邦, Moscow
参考
- Bakhteev O., Ogaltsov A., Khazov A., Safin K., Kuznetsova R. CrossLang: the System of Cross-lingual Plagiarism Detection // Workshop on Document Intelligence at NeurIPS. Vancouver, 2019.
- Avetisyan K., Gritsay G., Grabovoy. A. Cross-Lingual Plagiarism Detection: Two Are Better Than One // Programming and Computer Software. 2023. V. 49. P. 346–354.
- Kuznetsova M., Bakhteev O., Chekhovich Y. Methods of Cross-lingual Text Reuse Detection in Large Textual Collections // Informatika I Ee Primeneniya [Informatics and Its Applications]. 2021. V. 15. P. 30–41.
- Gritsay G., Grabovoy A., Kildyakov A., Chekhovich Y. Artificially Generated Text Fragments Search in Academic Documents // Doklady Rossijskoj Akademii Nauk. Matematika, Informatika, Processy Upravlenia. 2023. V. 108. P. 308–317.
- Gritsay G., Grabovoy A., Chekhovich Y. Automatic Detection of Machine Generated Texts: Need More Tokens // Ivannikov Memorial Workshop (IVMEM). Kazan, 2022. V. 108. P. 20–26.
- Ma H.J., Wan G., Lu E.Y. Digital Cheating and Plagiarism in Schools // Theory Into Practice. 2008. V. 47. P. 197–203.
- Wrigley S. Avoiding ‘de-plagiarism’: Exploring the Affordances of Handwriting in the Essay-writing Process // Active Learning in Higher Education. 2019. V. 20. P. 167–179.
- Bakhteev O., Kuznetsova R., Khazov A., Ogaltsov A., Safin K., Gorlenko T., Suvorova M., Ivahnenko A., Botov P. et. al. Near-duplicate Handwritten Document Detection Without Text Recognition // Intern. Conf. on Computational Linguistics and Intellectual Technologies. Moscow, 2021. P. 47–57.
- Krishnan P., Jawahar C.V. Matching Handwritten Document Images // Europ. Conf. on Computer Vision. Amsterdam, 2016. P. 766–782.
- Rowtula V., Bhargavan V., Kumar M., Jawahar C.V. Scaling Handwritten Student Assessments with a Document Image Workflow System // IEEE Conf. on Computer Vision and Pattern Recognition Workshops. Salt Lake City, 2018. P. 2307–2314.
- Pandey O., Gupta I., Mishra B.S.P. A Robust Approach to Plagiarism Detection in Handwritten Documents // Intern. Sympos. on Visual Computing. San Diego, 2020. P. 682–693.
- Coquenet D., Chatelain C., Paquet T. End-to-end Handwritten Paragraph Text Recognition Using a Vertical Attention Network // ArXiv 2021. ArXiv Preprint ArXiv:2012.03868.
- Voigtlaender P., Doetsch P., Ney H. Handwriting Recognition With Large Multidimensional Long Short-term Memory Recurrent Neural Networks // 15th Intern. Conf. on Frontiers in Handwriting Recognition (ICFHR). Shenzhen, 2016. P. 228–233.
- Khritankov A., Botov P., Surovenko N., Tsarkov S., Viuchnov D., Chekhovich Y. Discovering Text Reuse in Large Collections of Documents: A Study of Theses in History Sciences // Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conf. (AINLISMW FRUCT). St. Petersburg, 2015. P. 26–32.
- Potyashin I., Kaprielova M., Chekhovich Y., Kildyakov A., Seil T., Finogeev E., Grabovoy A. HWR200: New Open Access Dataset of Handwritten Texts Images in Russian // Intern. Conf. on Computational Linguistics and Intellectual Technologies. Moscow, 2023.
- Grieggs S., Shen B., Rauch G., Li P., Ma J., Chiang D., Price B., Scheirer W.J. Measuring Human Perception to Improve Handwritten Document Transcription // ArXiv 2019. ArXiv Preprint ArXiv:1904.03734 .
- Toselli A., Romero V., Villegas M., Vidal E., Sanchez J. HTR Dataset // Intern. Conf. on Frontiers in Handwriting Recognition (ICFHR). Shenzhen, 2016. P. 630635.
- Wang J., Song Y., Leung T., Rosenberg C., Wang J., Philbin J., Chen B., Wu Y. Learning Fine-grained Image Similarity With Deep Ranking // IEEE Conf. on Computer Vision and Pattern Recognition. Columbus, 2014. P. 1386–1393.
- Balntas V., Riba E., Ponsa D., Mikolajczyk K. Learning Local Feature Descriptors With Triplets and Shallow Convolutional Neural Networks // The British Machine Vision Conference (BMVC). 2016. V. 1. №2. P. 3.
- He K., Zhang X., Ren S., Sun J. Deep Residual Learning for Image Recognition // IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas, 2016.
- Annoy // https://github.com/spotify/annoy.
- Pinecone // https://github.com/pinecone-io.
- Johnson J., Douze M., J’egou H. Billion-scale Similarity Search With GPUs // IEEE Transactions on Big Data. 2019. V. 7. P. 535–547.
- Melekhov I., Kannala J., Rahtu E. Siamese Network Features for Image Matching // 23rd Intern. Conf. on Pattern Recognition, ICPR. Cancun, 2016. P. 378–383.
- Bakhteev O., Chekhovich Y., Finogeev E., Gorlenko T., Kaprielova M., Kildyakov A., Ogaltsov A. Image Reuse Detection in Large-scale Document Scientific Collection // ENAI Conf., Concurrent Sessions 12. Porto, 2022. P. 107.
- Patil B. V., Patil P. R. An Efficient DTW Algorithm for Online Signature Verification // Intern. Conf. On Advances in Communication and Computing Technology (ICACCT). Painpat, 2018. P. 1–5.
- Salvador S., Chan P. Toward Accurate Dynamic Time Warping in Linear Time and Space // Intellectual Data Analysis. 2007. V. 11. P. 561–580.
- Lowe D.G. Distinctive Image Features from Scale-invariant Keypoints // Intern. J. of Computer Vision. 2004. V. 60. P. 91–110.
- Rublee E., Rabaud V., Konolige K., Bradski G. ORB: An Efficient Alternative to SIFT or SURF // Intern. Conf. on Computer Vision. Barcelona, 2011. P. 25642571.
- DeTone D., Malisiewicz T., Rabinovich A. Superpoint: Self-supervised Interest Point Detection and Description // IEEE Conf. on Computer Vision and Pattern Recognition Workshops. Salt Lake City. 2018, P. 224–236.
- Barroso-Laguna A., Riba E., Ponsa D., Mikolajczyk K. Key. net: Keypoint Detection by Handcrafted and Learned cnn Filters // IEEE/CVF Intern. Conf. on Computer Vision. Seoul, 2019. P. 5836–5844.
- Mishkin D. Local Features: from Paper to Practice // Computer Vision and Pattern Recognition (CVPR) Workshops. Seattle, 2020.
- Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A., Polosukhin I. et. al. Attention Is All You Need // ArXiv 2017. ArXiv Preprint ArXiv:1706.03762.
- Sun J., Shen Z., Wang Y., Bao H., Zhou X. LoFTR: Detector-Free Local Feature Matching With Transformers // ArXiv 2021. ArXiv Preprint ArXiv:2104.00680. P. 8922–8931.
补充文件
