Data Extraction from Word documents using Python or similar tool

$1500-3000 USD

In Progress

Posted

about 9 years ago

$1500-3000 USD

Paid on delivery

I need someone expert with Python or a similar tool to scrape data from text documents that I will provide, to extract numerical and text data according to the rules I provide. The text documents are in Russian, so ability to read the Cyrillic alphabet is helpful, but as long as you are able to include Cyrillic text in the code, it’s not essential that you understand the words. Here is more detail: I have about 40,000 individual Microsoft word documents (1-4 pages in length each), which are court decisions in criminal cases (in Russian). I want to create a dataset from the information contained in these documents: for example, the name of the judge, the crimes charged, and the sentence imposed. I have written “identification rules” (in English, but using Russian “trigger words”) that indicate how to extract or code all the variables I want. For example, to extract the name of the defendant (accused criminal) in each case, the rule is “Eхtract the first Capitalized word following any one of these trigger words [УСТАНОВИЛ ОR Установил ОR У С Т А Н О В И Л OR установил].” Another example: to figure out whether a court reached a verdict, I want to create a variable “verdict,” which can be done by following the rule “Enter 1 or True if the title/header of the document is [П Р И Г О В О Р OR ПРИГОВОР].” Some rules require the extraction of text in the documents, and some rules just require a 1/0 or True/False depending on whether certain trigger words or phrases occur in each document. You would have to code these rules that I provide, so as to extract the information into a dataset that is readable in STATA (e.g., Excel, delimited such as CSV, dta, xml). As you see, so long as you are able to replicate the Russian words in your code, you need not understand them, although it would help. There are a lot of variables I want in the final dataset, and therefore, a lot of identification rules – about 150, which take up about 30 pages in a Word document. The rules are of various complexity, and some have multiple parts, unlike the examples above. That means a programmer will need to communicate with me if some rules are confusing or difficult to program, and I will try to revise them. I also anticipate that the first attempt at compiling the dataset will reveal problems that have to be addressed, and the programmer should be willing to help me address them. I am uploading the following to help you figure out if you can do this project: a document with just a few sample rules (“Sample Rules”), a document with just 2 samples of the cases from which the data is to be extracted (“Sample Cases”), and an Excel document that gives a sense for what I need the final product to look like (“Sample Output”) – although it need not be in Excel, of course.

Python

Project ID: 7289126

About the project

28 proposals

Remote project

Active 9 yrs ago

Looking to make some money?

Email address

Benefits of bidding on Freelancer

Set your budget and timeframe

Get paid for your work

Outline your proposal

It's free to sign up and bid on jobs

Awarded to:

@cugamelover

I'm a native Russian speaker and a computer science professional with a PhD degree and excellent Python skills. Natural Language Processing is my favorite field in which I have extensive knowledge. Please make sure to check my COMPLETE profile (ALL Skills) and see the reviews I received from other employers. It would be my pleasure to do your project and to discuss it with you. Please have a look at another large programming project I completed recently: https://www.freelancer.com/jobs/Excel-Mathematics/Convert-math-proofs-Excel-formulas/ As you can see I am perfectly capable of handling a $1,500 project to the complete satisfaction of my employer.

$2,000 USD in 21 days

5.0

(24 reviews)

5.6

28 freelancers are bidding on average $2,157 USD for this job

@kchg

Hi, I am a top 7th full-stack freelancer. Please get in touch and discuss in detail. I can get started right now. Best!

$3,092 USD in 30 days

5.0

(9 reviews)

6.7

@exansoft

Hello I'm Python developer and I'm very interested in your project. The Cyrillic alphabet is native for me 'course I'm from Ukraine. Please kindly provide more details related to project requirements. Thanks.

$2,500 USD in 25 days

5.0

(24 reviews)

5.8

@vvadimov

Hello, I am an experienced python programmer and I'd like to do this job for you. Also I'm Russian native speaker so it will simplify the work for me. I'm going to use python-docx library for doc file parsing (tell me if I cannot use it) and use python-3 (again, if you need python-2 script, tell me about it). I suppose I understand the rules however some implying some of them will require significant efforts. Thank you in advance. PS Price, time and milestones splitting are approximate, can be discussed. PPS Don't be afraid, I'm not working on Federal Security Servies :)

$2,275 USD in 30 days

5.0

(75 reviews)

5.8

@Fortut

A proposal has not yet been provided

$1,500 USD in 30 days

5.0

(98 reviews)

5.8

@ergo1wish

Good day, I'm a computer scientist with 6+ years of experience in Natural Language Processing and as a consequence converting word documents to all possible formats. Just a few months ago I was working on a project where I had to convert a word document to plain text with some specific tags. Based on previous experience I would say that python is not the best option for scrapping word documents, there is python-docx which is a package for processing word documents but it is still not good enough. I would suggest a combination of Java (docx4j) and python, which I think is the best possible combination. If you are interested contact me here, please note that the price and time are figurative and depend on further specifications.

$2,500 USD in 30 days

5.0

(8 reviews)

4.1

@rekreacija

hello, you can place your confidence in my Python/Russian-Cyrillic knowledge and experience. please feel free to ask for any information. greets, srdjan

$1,500 USD in 30 days

5.0

(15 reviews)

4.1

@dsrrathor

Hello, I've been working with Python since last 3 years and I've great experience over data processing with all type of characters (including Cyrillic alphabets). I'll complete your task within time frame at minimal cost and will provide you complete support till the end. Hope to listen from you soon!

$1,666 USD in 4 days

5.0

(10 reviews)

4.3

@fhasanbd

Hi, I am a professional web data scraper specialized using Python program, PHP script, .Net program, Crawler and Bot. My tool can search data and get information from Aa to Zz with an existing lists of english words. Below is the link for your reference as a sample related to my tool being developed. This demo will capture doctor's name, address, zip, phone, ratings and reviews in 4 different sites. The final output will be save in *.XLSX format or as your quirement.I can start as early possible depending on your approval and acceptance. In relation to this application, I can rest assured I will impart a high quality and reliable, efficient and accurate with the output. Give me a try and I will try to get the best results and finish the project far before the deadline. Thanks,Ferdous

$2,500 USD in 30 days

4.8

(6 reviews)

3.8

@jorjun

I studied Russian for two years, so I can read cyrillic letters - school level, tho' so I don't understand much of the sample document. I have ten years experience with Python, and although what you ask is complex I believe I can solve the task, and expect to collaborate with you from time to time using effective written communication and also with regular web-based reports (I can set this up quickly, don't worry.) I recommend use of a fast-insert noSQL database (MongoDB) for engine output, and a web report to show rule engine results / & perhaps original document text for comparison. From this kind of architectural arrangement I believe we can make the best & most accurate progress.

$3,000 USD in 10 days

5.0

(2 reviews)

3.2

@prabakarm23

Greetings for the Day!!! I have 6 years of experience in .NET, VBA Macros, VB script,PS Script and VB creation with application like SAP, Internet explorer, Microsoft Outlook,PDF& Text files, MS Access and SQL Server databases. And also I have worked on extraction with websites like Amazon, Cellpex,Costco,etc., hope if awarded with this project I can make it best and better with maximum 100% accuracy and satisfaction. please award me this project and contact me for further details Thanks Prabakar M

$2,500 USD in 30 days

5.0

(3 reviews)

2.5

@VadymVV

Hi there! I know Russian, and have 2 years experience in Python. So, I'm able to do this parsing tool for you.

$1,500 USD in 12 days

4.6

(3 reviews)

2.3

@mitosistech

A proposal has not yet been provided

$1,500 USD in 10 days

4.1

(2 reviews)

2.3

@ivaaradiic

Hello, As far as I can tell, you want a file containing a summary for every verdict (defendant, judge, sentence etc), and you are providing rules for extraction of every item (variable) that is required in that summary. Do you mind giving me an example of a more complex rule? I am a student and have experience in data extraction using scripting languages, though on a much smaller scale. I also know Serbian Cyrillic (very similar to Russian), which might be helpful.

$1,800 USD in 30 days

5.0

(1 review)

1.4

@alexandr79vw

Good afternoon, my name is Alec I'm from Ukraine. I perfectly know Russian language and I think it will be easier to solve your problem. I have experience programming from Python to scrape sites and online shops. Essentially scrap Word documents is no different. If you can do something interestno example free. Thank you await your response.

$2,222 USD in 10 days

0.0

(0 reviews)

0.0

@Vartolomej

I m Serb and my native letters are ciric, I can finished this task beacuse I scrapy data from many text documents.

$1,500 USD in 15 days

0.0

(0 reviews)

0.0

@sheenaoconnell

I have worked extensively in scraping and regular expression type projects. I've looked over the documents you provided and it doesn't seem hard at all. My approach would be to use Python for the whole lot. I will have a standard container that rules can be fit into in order to make it easy to append new rules to the code as needed. You did not mention how you would like the results of the script to be stored. I would suggest an sql database so that the results are easily queried.

$2,222 USD in 30 days

0.0

(0 reviews)

0.0

@Sandir

Здравствуйте, мы из Новороссии, так что думаю с русским языком проблем не будет. Для того чтобы сделать качественно все что вы попросили нам нужно времени около месяца - это включает разработку, тестирование и создание готового продукта по итогу. У нас есть похожие продукты по реализации, так что я думаю проблем не будет. Сделку будет проводить с оплатой частями. Напишите мне и мы более подробно пообщаемся.

$2,000 USD in 30 days

0.0

(0 reviews)

0.0

@AlexandruLodin

hello, my name is alexandru i have 3+ years experience in building custom complete applications using python. your project has very clear specifications thank you for that and it is very interesting. my solution for this project is based on python2.7 and pyqt4 for a nice GUI with which you can easely manipulate any document at any time, also the application will allow to add or remove any rule without the help of a programer. please let me know if you are interested, after which i can provide more details. looking forward to yout reply, alexandru

$3,333 USD in 15 days

0.0

(0 reviews)

0.0

@zhsoftstudio

Hello, I am an experienced software developer and a native Russian speaker. I am interesting in your project. I can help you and write a Python code to scrape data from text documents and as a result to form a CSV file according to the rules you provide. Best regards, Volodymyr

$2,000 USD in 15 days

0.0

(0 reviews)

1.0

@dpune

Hi, I have more than 14 years of exp and I am expert in this kind of work. I have completed more than 225 projects. Please look at the feedback left by my employers to know more about my work. Waiting for your positive response. Thanks.

$2,850 USD in 60 days