Web Crawling Text Mining Specialist

$5000-25000 USD

Cancelled

Posted

about 15 years ago

$5000-25000 USD

Paid on delivery

Our firm is looking for an experienced text mining developer/programmer to help with the development of our Journal Monitoring Project. ? We are developing this tool for a major BioTech company to search their products against PubMed. ? The search results will be downloaded and parsed to extract certain features including: Author, Article Title, Publication Date, Journal Name (volume pages), PubMEd DOI, Publication Date, methods. The results from the searches and extraction will be stored in a DB to be used for some minor BI. We will also need to include links to the abstract and full-text article if available without subscription. ? The entire project will be build in the cloud using Amazon EC2/S3 and a LAMP stack. ? We are looking to use open-source software whenever possible, only customizing where essential to deliver superior results.? ? ? ? ? ? ? ? ? ? ## Deliverables Here is an outline of the architecture. ? We are open to suggestions and would prefer to work with someone who has experience in the area and could add value to the architecture team not only the development side. Experience in searching PubMed Journal articles would be a plus. •Project Overview? •We are looking for a text mining specialist to assist withthe development of an automated data monitoring and analytics of product citation references for peer-reviewed journal articles in Pubmed •Deliverables:.? The project involves creating a database of journal articles that contain references to the clients Antibodies and Related Products (list ofproducts to be searched will be provided).? Our clients products are referenced in thousands of journalarticles, with more popular products appearing hundreds of times. •Pubmed contains peer-reviewed journal articles which willbe used to search against our clients products.? There are a couple ways to access PubMed Journalarticles.? During this first phaseof the project, we can use [login to view URL] or [login to view URL] to do the [login to view URL] advanced search methods are available and domain expertise in searchingPeer-reviewed journal articles is a plus.? We do have the domain expertise to assist in this area, we will workclosely with you to make sure that we are getting the best results possiblefrom the searches with our client. •Scope - Journal Monitoring •We will search across all scholarly journals indexed byPubmed to track for references to our clients products and extract the followingdata: ? (Listed below) ? The Phase I of the project will include followingdeliverables: Set-up development/staging environment in the cloud usingAmazon EC2.? We will work togetheron the stack to be used.? [login to view URL] Architecture ??" (develop a proof of concept.) Create Search Query Parser ??"? (unique solution for each site being searched?) Web Crawling - Searching product and company name in [[login to view URL]][1] and/[login to view URL] to find matches.? There will need to be some qualification of the search results to makesure that the results are from Journal articles that reference the clientproduct and name together.? (Ex.? The product and client name are within5 words of each other.) We will work together on the qualifying of results. ? ? ? ? ? ? ? ? ? ? ? Sampleof Product list | Anti-Rhodopsin | | Anti-RhoG, clone 1F3 B3 E5 | | Anti-Riboflavin | | Anti-Ribonucleotide Reductase, M1 subunit, clone AD203 | | Anti-Rig1/Robo3 | | Anti-MDR1b, ATP Binding Cassette Sub family B | | Anti-RNA Polymerase II, clone ARNA-3 | | Anti-RO52 | | Anti-ROCK-1 | ? Screen Scraper Cache SERP (search engine results pages) Search Results parser Download to database Document parsing ??" tokenizing? Store links to abstract and full text Feature Extraction We? will searchacross all scholarly journals indexed by Pubmed to track for references to ourclients products and extract the following data: ? Phase 1,Phase 2 •Product Names •Product Catalog Part Numbers (if available) •Journal Name, Volume Number and Pages •PubMed Digital Object Identifier (DOI citation number) •Author •Article Title •Publication Date •Methods •Applications •Hyperlink to Abstract •Hyperlink to Full Text •Targets - Pathways or Diseases •In Silico Pathway Analysis • Contact Mining process will extract name, title, employer,postal address, phone, email and article hyperlink from articles where the data is available Download all features to MySQL database Indexing to allow for some simple searching Provide XML file back to the client Integrated testing and analysis ? Phase II ??" not included this initial proposal Monthly Monitoring and feature extraction Integrate added feature extraction Advanced search and qualifying techniques Expanded product synonym search Additional search of data for: Targets ??" Pathways or Disease In Silico Pathway analysis Additional contact mining * * *This broadcast message was sent to all bidders on Thursday Mar 12, 2009 4:26:13 PM: I wanted to give a bit more specifics as to what the first phase of this project entails. Below is a description of what is needed. Please forward your proposals with these requirements. I welcome the opportunity to discuss this over the phone, but I would ask that you respond with something (the most difficult part, basic architecture, risks etc) to show you understand the scope of the project and ability to deliver the desired results.•Project Overview•We are looking for a text mining specialist to assist with the development of an automated data monitoring and analytics of product citation references for peer-reviewed journal articles in Pubmed•Deliverables:. The project involves creating a database of journal articles that contain references to the clients Antibodies and Related Products (list of products to be searched will be provided). Our clients products are referenced in thousands of journal articles, with more popular products appearing hundreds of times.•Pubmed contains peer-reviewed journal articles which will be used to search against our clients products. There are a couple ways to access PubMed Journal articles. During this first phase of the project, we can use [login to view URL] or [login to view URL] to do the searches. More advanced search methods are available and domain expertise in searching Peer-reviewed journal articles is a plus. We do have the domain expertise to assist in this area, we will work closely with you to make sure that we are getting the best results possible from the searches with our client.•Scope - Journal Monitoring•We will search across all scholarly journals indexed by Pubmed to track for references to our clients products and extract the following data: (Listed below)The Phase I of the project will include following deliverables:Set-up development/staging environment in the cloud using Amazon EC2. We will work together on the stack to be used. Ex. LAMP/PERLArchitecture ??" (develop a proof of concept.)Create Search Query Parser ??" (unique solution for each site being searched?)Web Crawling - Searching product and company name in [login to view URL] and/or [login to view URL] to find matches. There will need to be some qualification of the search results to make sure that the results are from Journal articles that reference the client product and name together. (Ex. The product and client name are within 5 words of each other.) We will work together on the qualifying of [login to view URL] the use of PubMed MESH, Synonym Citations, UMLS for more advanced search methods. (we have domain expertise in this area to help with but domain knowledge would be helpful)Sample of Product listAnti-RhodopsinAnti-RhoG, clone 1F3 B3 E5Anti-RiboflavinAnti-Ribonucleotide Reductase, M1 subunit, clone AD203Anti-Rig1/Robo3Anti-MDR1b, ATP Binding Cassette Sub family BAnti-RNA Polymerase II, clone ARNA-3Anti-RO52Anti-ROCK-1Screen ScraperCache SERP (search engine results pages)Download to databaseSearch Results parserDocument parsing ??" tokenizing?Download to data baseStore links to abstract and full textFeature ExtractionWe will search across all scholarly journals indexed by Pubmed to track for references to our clients products and extract the following data: Phase 1,Phase 2•Product Names•Product Catalog Part Numbers (if available)•Journal Name, Volume Number and Pages•PubMed Digital Object Identifier (DOI citation number)•Author•Article Title•Publication Date•Methods•Applications•Hyperlink to Abstract•Hyperlink to Full Text•Targets - Pathways or Diseases•In Silico Pathway Analysis• Contact Mining process will extract name, title, employer, postal address, phone, email and article hyperlink from articles where the data is availableScreen Shot of a Journal Page showing most of the required features to [login to view URL] all features to MySQL databaseIndexing to allow for some simple searchingProvide XML file back to the clientIntegrated testing and analysisPhase II ??" not included this initial proposalMonthly Monitoring and feature extractionIntegrate added feature extractionAdvanced search and qualifying techniquesExpanded product synonym searchAdditional search of data for:Targets ??" Pathways or DiseaseIn Silico Pathway analysisAdditional contact mining

Software Architecture

Software Testing

Project ID: 3717085