Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

StackOverflow Point

StackOverflow Point Navigation

  • Web Stories
  • Badges
  • Tags
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Web Stories
  • Badges
  • Tags
Home/ Questions/Q 245383
Next
Alex Hales
  • 0
Alex HalesTeacher
Asked: August 17, 20222022-08-17T04:25:14+00:00 2022-08-17T04:25:14+00:00In: Elasticsearch

How to index a .PDF file in ElasticSearch

  • 0

[ad_1]

I found below code here Pdf to elastic search,
the code extracts pdf and put into elastic search

import PyPDF2
import re
import requests
import json
import os
from datetime import date

class ElasticModel:

    name = ""
    msg = ""

    def toJSON(self):
        return json.dumps(self, default=lambda o: o.__dict__, 
            sort_keys=True, indent=4)

def __readPDF__(path):
    # pdf file object
    # you can find find the pdf file with complete code in below
    pdfFileObj = open(path, 'rb')
    # pdf reader object
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    # number of pages in pdf
    print(pdfReader.numPages)
    # a page object
    pageObj = pdfReader.getPage(0)
    # extracting text from page.
    # this will print the text you can also save that into String
    line = pageObj.extractText() 
    line = line.replace("\n","")
    print(line)
    return line


#line = pageObj.extractText()

def __prepareElasticModel__(line, name):
    eModel = ElasticModel();

    eModel.name = name
    eModel.msg = line
    return eModel


def __sendToElasticSearch__(elasticModel):
    print("Name : " + str(eModel))

############################################
####  #CHANGE INDEX NAME IF NEEDED
#############################################
    index = "samplepdf"

    url = "http://localhost:9200/" + index +"/_doc?pretty"
    data = elasticModel.toJSON()
    #data = serialize(eModel)
    response = requests.post(url, data=data,headers={
                    'Content-Type':'application/json',
                    'Accept-Language':'en'

                })
    print("Url : " + url)
    print("Data : " + str(data))

    print("Request : " + str(requests))
    print("Response : " + str(response))


#################################
#Change pdf dir path
###################################
pdfdir = "C:/Users/abhis/Desktop/TemplatesPDF/SamplePdf"

listFiles = os.listdir(pdfdir)
for file in listFiles :
    path = pdfdir + "https://stackoverflow.com/" + file
    print(path)

    line = __readPDF__(path)
    eModel = __prepareElasticModel__(line, file)
    __sendToElasticSearch__(eModel)

The above code is extracting the sample pdf

enter image description here

From above sample pdf, few fields (Such as Name and Msg) has been extracted using regex and inserted into elastic search, Hope this would help

[ad_2]

  • 0 0 Answers
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report
Leave an answer

Leave an answer
Cancel reply

Browse

Sidebar

Ask A Question

Related Questions

  • xcode - Can you build dynamic libraries for iOS and ...

    • 0 Answers
  • bash - How to check if a process id (PID) ...

    • 2 Answers
  • database - Oracle: Changing VARCHAR2 column to CLOB

    • 4 Answers
  • What's the difference between HEAD, working tree and index, in ...

    • 3 Answers
  • Amazon EC2 Free tier - how many instances can I ...

    • 0 Answers

Stats

  • Questions : 43k

Subscribe

Login

Forgot Password?

Footer

Follow

© 2022 Stackoverflow Point. All Rights Reserved.

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.