In Progress

Data Extraction from Word documents using Python or similar tool

I need someone expert with Python or a similar tool to scrape data from text documents that I will provide, to extract numerical and text data according to the rules I provide. The text documents are in Russian, so ability to read the Cyrillic alphabet is helpful, but as long as you are able to include Cyrillic text in the code, it’s not essential that you understand the words.

Here is more detail: I have about 40,000 individual Microsoft word documents (1-4 pages in length each), which are court decisions in criminal cases (in Russian). I want to create a dataset from the information contained in these documents: for example, the name of the judge, the crimes charged, and the sentence imposed. I have written “identification rules” (in English, but using Russian “trigger words”) that indicate how to extract or code all the variables I want. For example, to extract the name of the defendant (accused criminal) in each case, the rule is “Eхtract the first Capitalized word following any one of these trigger words [УСТАНОВИЛ ОR Установил ОR У С Т А Н О В И Л OR установил].” Another example: to figure out whether a court reached a verdict, I want to create a variable “verdict,” which can be done by following the rule “Enter 1 or True if the title/header of the document is [П Р И Г О В О Р OR ПРИГОВОР].” Some rules require the extraction of text in the documents, and some rules just require a 1/0 or True/False depending on whether certain trigger words or phrases occur in each document. You would have to code these rules that I provide, so as to extract the information into a dataset that is readable in STATA (e.g., Excel, delimited such as CSV, dta, xml). As you see, so long as you are able to replicate the Russian words in your code, you need not understand them, although it would help.

There are a lot of variables I want in the final dataset, and therefore, a lot of identification rules – about 150, which take up about 30 pages in a Word document. The rules are of various complexity, and some have multiple parts, unlike the examples above. That means a programmer will need to communicate with me if some rules are confusing or difficult to program, and I will try to revise them. I also anticipate that the first attempt at compiling the dataset will reveal problems that have to be addressed, and the programmer should be willing to help me address them.

I am uploading the following to help you figure out if you can do this project: a document with just a few sample rules (“Sample Rules”), a document with just 2 samples of the cases from which the data is to be extracted (“Sample Cases”), and an Excel document that gives a sense for what I need the final product to look like (“Sample Output”) – although it need not be in Excel, of course.

Skills: Python

See more: need python programmer for project, need python programmer, microsoft excel programmer, c programming, python, c++ programming, microsoft excel expert examples, i need a python programmer, i need a programmer in python, i need an r programmer, how to be an microsoft excel expert, how to be an expert in microsoft excel, how can i expert in excel, expert in microsoft excel, expert in excel help, excel expert course, examples of complexity, difficult decisions, complexity examples, what is microsoft word

About the Employer:
( 1 review ) Cambridge, United States

Project ID: #7289126

Awarded to:


I'm a native Russian speaker and a computer science professional with a PhD degree and excellent Python skills. Natural Language Processing is my favorite field in which I have extensive knowledge. Please make sure to More

$2000 USD in 21 days
(1 Review)

29 freelancers are bidding on average $2159 for this job


Hi. This seems to an interesting project and I decided to bid on it. Basically my idea about this is to create a config with all the rules we may encounter/need. Plus we can have the rules per document(if this is th More

$2500 USD in 7 days
(40 Reviews)

Hi, I have read your post and understood your requirement. I have great experience in handling /Python/PHP/Wordpress/Magento/Joomla/Drupal/Angular.js/node.js HTML5/CSS3/Java/Django/Javascript/MySQL/iOS/Android Kin More

$3092 USD in 30 days
(2 Reviews)

A proposal has not yet been provided

$1500 USD in 30 days
(36 Reviews)

Hello, I am an experienced python programmer and I'd like to do this job for you. Also I'm Russian native speaker so it will simplify the work for me. I'm going to use python-docx library for doc file parsing (tell me More

$2275 USD in 30 days
(16 Reviews)

Good day, I'm a computer scientist with 6+ years of experience in Natural Language Processing and as a consequence converting word documents to all possible formats. Just a few months ago I was working on a project whe More

$2500 USD in 30 days
(4 Reviews)

Hello, I've been working with Python since last 3 years and I've great experience over data processing with all type of characters (including Cyrillic alphabets). I'll complete your task within time frame at minimal co More

$1666 USD in 4 days
(8 Reviews)

Hello I'm Python developer and I'm very interested in your project. The Cyrillic alphabet is native for me 'course I'm from Ukraine. Please kindly provide more details related to project requirements. Thanks.

$2500 USD in 25 days
(1 Review)

Hi, I am a top 7th full-stack freelancer. Please get in touch and discuss in detail. I can get started right now. Best!

$3092 USD in 30 days
(1 Review)

Greetings for the Day!!! I have 6 years of experience in .NET, VBA Macros, VB script,PS Script and VB creation with application like SAP, Internet explorer, Microsoft Outlook,PDF& Text files, MS Access and SQL Serv More

$2500 USD in 30 days
(1 Review)

hello, you can place your confidence in my Python/Russian-Cyrillic knowledge and experience. please feel free to ask for any information. greets, srdjan

$1500 USD in 30 days
(2 Reviews)

Hello, As far as I can tell, you want a file containing a summary for every verdict (defendant, judge, sentence etc), and you are providing rules for extraction of every item (variable) that is required in that summ More

$1800 USD in 30 days
(1 Review)

A proposal has not yet been provided

$1500 USD in 10 days
(2 Reviews)

Hi there! I know Russian, and have 2 years experience in Python. So, I'm able to do this parsing tool for you.

$1500 USD in 12 days
(1 Review)

check [login to view URL] Event site:[login to view URL] [login to view URL] [login to view URL] [login to view URL] [login to view URL] EXPE More

$1700 USD in 30 days
(2 Reviews)

I have rich experience with word documents and text processing using python, as well as results generation in CSV, will probably provide SQLite DB with the result data as well, it can be handy in possible future proces More

$1500 USD in 20 days
(0 Reviews)

Hi, I have more than 14 years of exp and I am expert in this kind of work. I have completed more than 225 projects. Please look at the feedback left by my employers to know more about my work. Waiting for your positive More

$2850 USD in 60 days
(0 Reviews)

I am experienced with Word / Excel automation tasks and I am native Russian so I can understand this documents perfectly. Feel free to ask me for demo application if you are interested.

$1500 USD in 7 days
(0 Reviews)

Proposal belum diberikan

$2500 USD in 30 days
(1 Review)

A proposal has not yet been provided

$1666 USD in 25 days
(0 Reviews)

Здравствуйте, мы из Новороссии, так что думаю с русским языком проблем не будет. Для того чтобы сделать качественно все что вы попросили нам нужно времени около месяца - это включает разработку, тестирование и создание More

$2000 USD in 30 days
(0 Reviews)