I'm looking for someone to code a solid website scraper / crawler. I've already coded a version, however it is not as good as I need it to be, so I need help to create a new better version from scratch.
In short I need to be able to manage (create/edit/delete) scraping tasks through a robust, flexible and advanced UI; scraping task script need to look for things to do on regular intervals (optimally as an update daemon service on my Ubuntu VPS instead of a CRON task) with data getting scraped and inserted into an MYSQL database. The sites in questions are generally news sites relating to games and tech; key data is headlines, intro and/or full content, date published, author and URL to full story (similar to what an RSS feed could provide, but these site do not have RSS feeds).
Beyond use of PHP/JQuery and Ajax I expect you to use something like SimpleHTMLdom (which I used, however maybe you prefer another framework - so can be discussed) and Datatables for all types of tables (alternatively some bootstrap tables).
Also note that I use a them called Metronic – Admin Dashboard for my general UI design, I can provide a default template and link in that regard.
Features that will be required
Advanced create/edit/delete tasks UI so that tasks to do everything can be done via the UI as far as possible to ensure a page can get scraped for data.
Smart way to manage multiple page scrapes from the same website. E.g. when there is no way to fetch, news, reviews and features from a single page.
List of tasks with relevant status; search, filter, sort and manage options
Update daemon that can run as a background process on an VPS Ubuntu 14.04 box. This manage all the tasks based on task settings and interval criteria to fetch data.
Error handling; able to recover in case of failed fetches, interruptions, re-schedule tasks etc., logging of what is going on and error’s that occurred.
Error management; warnings system that flags tasks that might have issues, e.g. we’re no longer scraping a headline or an author etc. e.g. site change code that can cause issues.
Happy to answer any further questions, just ask.
IMPORTANT
Timeline/deadlines; while I would have loved to have this done yesterday, do let me know an estimate of how much time you believe will be required to complete the project. A high level of English also required. Offers that ignores to provide this information will not be considered.
See attached images for a view of my current system.
You will receive EXCELLENT results from my work.
My reviews speak of my excellent attention to detail and my great customer service!
Please review my profile and read my client reviews (101 reviews - 5 stars).
I would be grateful to have the opportunity to chat with you and discuss your project in detail. I look
forward to hearing from you.
Sparximer