Java program using Apache Tika to extract text from various formats into a String object

Apache Tika can get the plain text from so many formats like Microsoft's Office files and PDF etc.

The tika app jar file can output plain text from those files and print them into a file or console.

This is a sample java program that uses this tika jar file as library and uses the Parser Api to get the text into a String object.

TextExtractor.java

 

import java.io.ByteArrayOutputStream;

import java.io.File;

import java.io.InputStream;

import java.io.OutputStream;

import java.io.OutputStreamWriter;

import java.net.URL;

 

import org.apache.tika.detect.DefaultDetector;

import org.apache.tika.detect.Detector;

import org.apache.tika.io.TikaInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

 

import org.xml.sax.ContentHandler;

 

class TextExtractor { 

    private OutputStream outputstream;

    private ParseContext context;

    private Detector detector;

    private Parser parser;

    private Metadata metadata;

    private String extractedText;

 

    public TextExtractor() {

        context = new ParseContext();

        detector = new DefaultDetector();

        parser = new AutoDetectParser(detector);

        context.set(Parser.class, parser);

        outputstream = new ByteArrayOutputStream();

        metadata = new Metadata();

    }

 

    public void process(String filename) throws Exception {

        URL url;

        File file = new File(filename);

        if (file.isFile()) {

            url = file.toURI().toURL();

        } else {

            url = new URL(filename);

        }

        InputStream input = TikaInputStream.get(url, metadata);

        ContentHandler handler = new BodyContentHandler(outputstream);

        parser.parse(input, handler, metadata, context); 

        input.close();

    }

 

 

    public void getString() {

        //Get the text into a String object

        extractedText = outputstream.toString();

        //Do whatever you want with this String object.

        System.out.println(extractedText);

    }

 

    public static void main(String args[]) throws Exception {

        if (args.length == 1) {

            TextExtractor textExtractor = new TextExtractor();

            textExtractor.process(args[0]);

            textExtractor.getString();

        } else { 

            throw new Exception();

        }

    }

}

 

 

Compile:

javac -cp ".:tika-app-1.0.jar" TextExtractor.java

 

Run:

java -cp ".:tika-app-1.0.jar" TextExtractor SomeWordDocument.doc

Note: Replace ":" with ";" if you are in Windows

 

Posted via email from Art, Science & Technology

Python script to create files/folders from a template


I was following this jenkins plugin tutorial today http://javaadventure.blogspot.in/2008/01/writing-hudson-plug-in-part-1.html which required me to create several directory structures. I found it vexing to do them manually. So I got this idea to make a python program that creates directory structures by reading a template file.

We will have to pass it a template file that specifies the directory structure which will be created. The template file should look something like this:

foldername/
    filename.ext
    subfoldername/
        subsubfoldername/
            anotherfile.txt
    someotherfile.xml

It can be anything like that. There are only two syntax requirements.
1.Indentation must be uniform (like in a python program)
2.folder names must end with a '/' otherwise the program can't know which is folder and which is file.

Screenshot:

Pydir3

 

Code:

 

#!/usr/bin/python

import sys,os

 

#I'll be using this as a stack pushing and popping indent levels

indentlevels = [0]

 

#Opening the file specified in argument

f=open(sys.argv[1])

 

#If the argument is /hello/structure.txt PARENTPATH should be /hello/

PARENTPATH=os.path.abspath(f.name)

PARENTPATH=PARENTPATH[:PARENTPATH.rindex('/')+1]

 

#We'll be reading line by line. This variable is used to store previous line.

previous = ""

 

#Count no. of leading spaces in a line.

#Empty lines give a length of 1 considering '\n' so will return -1 to ignore later

def countSpaces(data):

    return len(data)-len(data.lstrip()) if (len(data)>1) else -1

 

#This function creates the files or folders

def touch(data):

    global PARENTPATH

    if (data.endswith('/')):

        os.mkdir(PARENTPATH+data)

    else:

        f=open(PARENTPATH+data,"w")

        f.close()

 

 

 

#Main program starts.. Iterate through lines

line=f.readline()

 

while line:

    #Get the indent level

    indentlevel = countSpaces(line)

 

    #Remove leading and trailing spaces and get only the string

    line=line.strip()

 

    #Ignore empty lines and continue the loop

    if (indentlevel == -1):

        line = f.readline()

        continue

 

    #If indent is increased,

    if(indentlevel > indentlevels[-1]):

 

        #Check whether the previous line ended with '/'.

        #Because we can put files and folders within a folder. Not possible within  a file

        if not previous.endswith('/'):

            print "SYNTAX ERROR.. You can indent a line further only if the above line specifies a folder (ends with /)"

            sys.exit()

 

 

        #add the new indentlevel to indentlevels

        indentlevels.append(indentlevel)

        #and append the above string as a folder level to the PARENTPATH

        PARENTPATH=PARENTPATH+previous

 

    #If indent level is reduced,

    elif(indentlevel < indentlevels[-1]):

 

        #pop the last element from the indentlevels

        indentlevels.pop() #ref1

 

        #If indent is reduced more than one step (eg. dirstruct4),

        #then pop the indentlevels until it's equal to indentlevel

        #and do the same for parent path

        while indentlevels[-1]>indentlevel:

            indentlevels.pop()

            PARENTPATH=PARENTPATH[:PARENTPATH[:-1].rindex('/')+1]  

            # Because I want /hai/bai/ to become /hai/

 

        #If this condition fails, then it means the file is not indented uniformly

        if not (indentlevel == indentlevels[-1]):

            print "SYNTAX ERROR.. INDENTATION MUST BE UNIFORM"

            sys.exit()

 

        #required for line '#ref1'

        PARENTPATH=PARENTPATH[:PARENTPATH[:-1].rindex('/')+1]  

 

    #else: #indentlevel == indentlevels[-1]

    touch(line)

 

    previous=line

    line=f.readline()

 

 

Try this sample dir structure:

project/

   SimpleM2Project/

       pom.xml

       src/

           main/

               java/

                   com/

                       onedash/

                           hello/

                               Hello.java

           test/

               java/

Posted via email from Art, Science & Technology