I currently only want to do sentiment analysis of half yearly reports but its proving to be a difficult task since I can’t just extract the pdf text (with PyPDF2) and then preprocess that with nltk and then just apply a few models. Well I can but its missing a lot of stuff, the page formatting and tables and different art forms used makes it hard to just extract and tokenize the text. Is there any tool currently available which makes this process easier? In my head doing document analysis through CNNs or something shouldn’t be that hard, we can extract information and restructure and format it in a better way and make a model that can repeat the process to all financial statements we feed it to
So.. is there anything like this currently available? I found this on github but its slightly different and in R. Can do sentiment analysis through twitter too but reading the Director’s Review sounds like it would be better since it gives a more complete picture, can later match it up with stuff like EBIT and do some ratio analysis to give a firm a rating or something. On a basic level i don’t even need CNNs this is just a text mining problem, but we can make it out to be a better tool. Discuss?