Using arxiv-like docs as dataset for retrieval-based bot?

Hi there,

I want to develop a bot that answers machine learning / AI related questions, the idea is to give objective answers, methods or definitions about ML key concepts and tools. I thought that the best way of achieving this objectively would be trough using a retrieval-based bot (since at first I don’t see any necessity of applying generative methods) that has as search space well known ML papers and books (from Bengio, Yan Le Cun, Goodfellow, Hastie, Bishop, etc). The problem is that I’m just starting to apply NLP so I don’t know how those documents (suppose PDFs) could be used as my bot’s training dataset.

Any answer would be very appreciated.


Are they Q&A type based documents? A certain competitor based in Redmond has a service specifically for that…

@Vik If you are talking about MS’s QnA Bot Maker, it really is quite basic: its algorithm is simply tf-idf based whole document retrieval. It is designed to work with FAQ-like content, but beyond that, it’s hard to see how it could be sufficient for this use case.

@gonesbuyo one area of research that you could look at is the reading comprehension algorithms. Take a look at SQuAD ( and the algorithms being submitted to that, and see if that works for you.