I need someone expert with Python or a similar tool to scrape data from text documents that I will provide, to extract numerical and text data according to the rules I provide. The text documents are in Russian, so ability to read the Cyrillic alphabet is helpful, but as long as you are able to include Cyrillic text in the code, it’s not essential that you understand the words.
Here is more detail: I have about 40,000 individual Microsoft word documents (1-4 pages in length each), which are court decisions in criminal cases (in Russian). I want to create a dataset from the information contained in these documents: for example, the name of the judge, the crimes charged, and the sentence imposed. I have written “identification rules” (in English, but using Russian “trigger words”) that indicate how to extract or code all the variables I want. For example, to extract the name of the defendant (accused criminal) in each case, the rule is “Eхtract the first Capitalized word following any one of these trigger words [УСТАНОВИЛ ОR Установил ОR У С Т А Н О В И Л OR установил].” Another example: to figure out whether a court reached a verdict, I want to create a variable “verdict,” which can be done by following the rule “Enter 1 or True if the title/header of the document is [П Р И Г О В О Р OR ПРИГОВОР].” Some rules require the extraction of text in the documents, and some rules just require a 1/0 or True/False depending on whether certain trigger words or phrases occur in each document. You would have to code these rules that I provide, so as to extract the information into a dataset that is readable in STATA (e.g., Excel, delimited such as CSV, dta, xml). As you see, so long as you are able to replicate the Russian words in your code, you need not understand them, although it would help.
There are a lot of variables I want in the final dataset, and therefore, a lot of identification rules – about 150, which take up about 30 pages in a Word document. The rules are of various complexity, and some have multiple parts, unlike the examples above. That means a programmer will need to communicate with me if some rules are confusing or difficult to program, and I will try to revise them. I also anticipate that the first attempt at compiling the dataset will reveal problems that have to be addressed, and the programmer should be willing to help me address them.
I am uploading the following to help you figure out if you can do this project: a document with just a few sample rules (“Sample Rules”), a document with just 2 samples of the cases from which the data is to be extracted (“Sample Cases”), and an Excel document that gives a sense for what I need the final product to look like (“Sample Output”) – although it need not be in Excel, of course.
29 freelancers are bidding on average $2159 for this job
Hello I'm Python developer and I'm very interested in your project. The Cyrillic alphabet is native for me 'course I'm from Ukraine. Please kindly provide more details related to project requirements. Thanks.
hello, you can place your confidence in my Python/Russian-Cyrillic knowledge and experience. please feel free to ask for any information. greets, srdjan
I am experienced with Word / Excel automation tasks and I am native Russian so I can understand this documents perfectly. Feel free to ask me for demo application if you are interested.