Find Jobs
Hire Freelancers

Text mining in Python to identify interesting document files

$30-250 CAD

Selesai
Disiarkan lebih dari 11 tahun yang lalu

$30-250 CAD

Dibayar semasa penghantaran
I want to do a simple text mining task on a large number of files with python code. The files are stored in a few larger network shares and sum up to about a million of files, with about 100 different filetypes. The "text and document" filetypes that are considered extra interesting are microsoft office files, pdf and text (doc, docx, xls, xlsx, xlsm, ppt,pptx, pdf, txt, html, xml, etc..). But there are also a few binary files, movies and others that are considered as less interesting. I have a list of about 100 interesting words organized in a textfile (txt) , one word per row. Now I want to identify all files that contain one or more instances of the words in their filename, path or in their file contents. I would like to get this task solved in python. I am not an experienced python programmer and I would like to get the code well written, well annotated and easy to modify. Communication and code should be in english. The code should work preferable in Windows, MacOSX and Linux. I would like 1) A script to list all the files (not folders) in a networks share. Number the list, one file per row. List interesting file information, Columns separated by semicolon (;). Something like this. File Counter; Full path; creation date; modifcation date; file owner, filetype; etc.. 1; C:/mypath/[login to view URL]; date; date; owner; text document 2; C:/mypath/[login to view URL]; date; date; owner; Microsoft Word 2) One or several scripts to scan the names and contents (based on the file of interesting words) of the files from script 1). "Document filetypes and textfiles" (see above) shall be scanned for both the content and the full path (filname+path). Other files don't have to be scanned for content but need to be scanned for the full path. The script(s) shall report the action on each file and the result (name scan: yes/no/error, content scan: yes/no/error). If an error occurs with a file (e.g. for reading or parsing) this need to be stated in the result of the file scan, but should not interrupt the scan. The number of matches in the content and by which word should be stated. The number of unique positive (identified) words in the list of words, i.e. must be 0-about 100 should also be stated for each file. Final output should be a similar list as in 1) but with additional columns containing. e.g. Path Scan; Content Scan; #Matches; Words matched; #Unique matches (Yes/No/Error); (Yes/No/Error); Integer; Monkey, Bannanas; 0-100, Provide me with a way to run this search as easy and quickly as possible. (There are a lots of files and speed is important.) I don't want to wait two weeks for the search and I don't want to find out that the scripts run into an error after three hour and stops. The bidders with recommendations, high evaluations, strong background in python and text mining are preferred.
ID Projek: 2473761

Tentang projek

7 cadangan
Projek jarak jauh
Aktif 12 tahun yang lalu

Ingin menjana wang?

Faedah membida di Freelancer

Tetapkan bajet dan garis masa anda
Dapatkan bayaran untuk kerja anda
Tuliskan cadangan anda
Ianya percuma untuk mendaftar dan membida pekerjaan
Dianugerahkan kepada:
Avatar Pengguna
I have significant experience with data extraction and machine learning, so creating a script like this should not be a problem for me. Documenting the code and communicating with you won't be an issue either, as I speak fluent English. Please check your PMB for some more info.
$220 CAD dalam 10 hari
4.8 (20 ulasan)
5.3
5.3
7 pekerja bebas membida secara purata $203 CAD untuk pekerjaan ini
Avatar Pengguna
I have experience in text mining, but I will have to learn some things about network interfacing. I will complete this project in the time allotted.
$250 CAD dalam 14 hari
4.9 (8 ulasan)
4.7
4.7
Avatar Pengguna
I have very good experience in data mining algorithms and Python.
$200 CAD dalam 10 hari
5.0 (2 ulasan)
3.2
3.2
Avatar Pengguna
Sounds like a hefty challenge. I'm up for that. Based in Toronto
$250 CAD dalam 3 hari
0.0 (1 ulasan)
2.3
2.3
Avatar Pengguna
Custom software development (<b><i>Removed by Admin</i></b>)
$250 CAD dalam 1 hari
0.0 (0 ulasan)
0.0
0.0
Avatar Pengguna
Being a system administrator who deals in Python, this is right up my alley. We'll just have to clarify what your network setup is.
$200 CAD dalam 3 hari
0.0 (0 ulasan)
0.0
0.0
Avatar Pengguna
I have extensive knowledge in text mining and data mining as I had taken courses related to those while at school(Stanford University)
$50 CAD dalam 2 hari
0.0 (0 ulasan)
0.0
0.0

Tentang klien

Bendera SWEDEN
Tullinge, Sweden
5.0
1
Kaedah pembayaran disahkan
Ahli sejak Sep 9, 2012

Pengesahan Klien

Terima kasih! Kami telah menghantar pautan melalui e-mel kepada anda untuk menuntut kredit percuma anda.
Sesuatu telah berlaku semasa menghantar e-mel anda. Sila cuba lagi.
Pengguna Berdaftar Jumlah Pekerjaan Disiarkan
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Memuatkan pratonton
Kebenaran diberikan untuk Geolocation.
Sesi log masuk anda telah luput dan telah dilog keluar. Sila log masuk sekali lagi.