This is done according to the Screening Test of Code Vector
- Completed the download of 50 LinkedIn profiles, that are saved in the
LinkedInProfilesfolder of the repository.
converter.py contains the script for task 2,3.
- Extracted the text from the first line and saved it in the first column of the task file
- Extracted 10 most frequently used words from every profile's data, excluding stopwords.
- Used tf-idf to scoring methodology to score every word, and return the 10 most important words(excluding stopwords) i.e. words with highest tfidf score, where tfidf - term frequency & Inverse document frequency.
Functions curated for tfidf implementation have been extracted to tfidf.py.
Install the requirements.txt in your environment and Run python converter.py to generate the required task file. The code has been commented for explanation.
I have used Django REST Framework to make the API. Navigate to the django project converter_api for accessing the same. The django app convert contains the required APIs.
Run python manage.py runserver to use it in your local server. Make sure you have installed the requirements.
- At
'pdf_to_text/'there exists the api which takespdf_filein the input and returns thetextwithin it as output. - At
'text_to_info/'there exists the api which takestextas input and returns the top 10 most frequent words, and the top 10 most important words in that piece of text. I have used tf-idf for the same, calculating idf with respect to data collected in Task-3.