Java program using Apache Tika to extract text from various formats into a String object

Apache Tika can get the plain text from so many formats like Microsoft's Office files and PDF etc.

The tika app jar file can output plain text from those files and print them into a file or console.

This is a sample java program that uses this tika jar file as library and uses the Parser Api to get the text into a String object.

TextExtractor.java

 

import java.io.ByteArrayOutputStream;

import java.io.File;

import java.io.InputStream;

import java.io.OutputStream;

import java.io.OutputStreamWriter;

import java.net.URL;

 

import org.apache.tika.detect.DefaultDetector;

import org.apache.tika.detect.Detector;

import org.apache.tika.io.TikaInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

 

import org.xml.sax.ContentHandler;

 

class TextExtractor { 

    private OutputStream outputstream;

    private ParseContext context;

    private Detector detector;

    private Parser parser;

    private Metadata metadata;

    private String extractedText;

 

    public TextExtractor() {

        context = new ParseContext();

        detector = new DefaultDetector();

        parser = new AutoDetectParser(detector);

        context.set(Parser.class, parser);

        outputstream = new ByteArrayOutputStream();

        metadata = new Metadata();

    }

 

    public void process(String filename) throws Exception {

        URL url;

        File file = new File(filename);

        if (file.isFile()) {

            url = file.toURI().toURL();

        } else {

            url = new URL(filename);

        }

        InputStream input = TikaInputStream.get(url, metadata);

        ContentHandler handler = new BodyContentHandler(outputstream);

        parser.parse(input, handler, metadata, context); 

        input.close();

    }

 

 

    public void getString() {

        //Get the text into a String object

        extractedText = outputstream.toString();

        //Do whatever you want with this String object.

        System.out.println(extractedText);

    }

 

    public static void main(String args[]) throws Exception {

        if (args.length == 1) {

            TextExtractor textExtractor = new TextExtractor();

            textExtractor.process(args[0]);

            textExtractor.getString();

        } else { 

            throw new Exception();

        }

    }

}

 

 

Compile:

javac -cp ".:tika-app-1.0.jar" TextExtractor.java

 

Run:

java -cp ".:tika-app-1.0.jar" TextExtractor SomeWordDocument.doc

Note: Replace ":" with ";" if you are in Windows

 

Posted via email from Art, Science & Technology

Python script to create files/folders from a template


I was following this jenkins plugin tutorial today http://javaadventure.blogspot.in/2008/01/writing-hudson-plug-in-part-1.html which required me to create several directory structures. I found it vexing to do them manually. So I got this idea to make a python program that creates directory structures by reading a template file.

We will have to pass it a template file that specifies the directory structure which will be created. The template file should look something like this:

foldername/
    filename.ext
    subfoldername/
        subsubfoldername/
            anotherfile.txt
    someotherfile.xml

It can be anything like that. There are only two syntax requirements.
1.Indentation must be uniform (like in a python program)
2.folder names must end with a '/' otherwise the program can't know which is folder and which is file.

Screenshot:

Pydir3

 

Code:

 

#!/usr/bin/python

import sys,os

 

#I'll be using this as a stack pushing and popping indent levels

indentlevels = [0]

 

#Opening the file specified in argument

f=open(sys.argv[1])

 

#If the argument is /hello/structure.txt PARENTPATH should be /hello/

PARENTPATH=os.path.abspath(f.name)

PARENTPATH=PARENTPATH[:PARENTPATH.rindex('/')+1]

 

#We'll be reading line by line. This variable is used to store previous line.

previous = ""

 

#Count no. of leading spaces in a line.

#Empty lines give a length of 1 considering '\n' so will return -1 to ignore later

def countSpaces(data):

    return len(data)-len(data.lstrip()) if (len(data)>1) else -1

 

#This function creates the files or folders

def touch(data):

    global PARENTPATH

    if (data.endswith('/')):

        os.mkdir(PARENTPATH+data)

    else:

        f=open(PARENTPATH+data,"w")

        f.close()

 

 

 

#Main program starts.. Iterate through lines

line=f.readline()

 

while line:

    #Get the indent level

    indentlevel = countSpaces(line)

 

    #Remove leading and trailing spaces and get only the string

    line=line.strip()

 

    #Ignore empty lines and continue the loop

    if (indentlevel == -1):

        line = f.readline()

        continue

 

    #If indent is increased,

    if(indentlevel > indentlevels[-1]):

 

        #Check whether the previous line ended with '/'.

        #Because we can put files and folders within a folder. Not possible within  a file

        if not previous.endswith('/'):

            print "SYNTAX ERROR.. You can indent a line further only if the above line specifies a folder (ends with /)"

            sys.exit()

 

 

        #add the new indentlevel to indentlevels

        indentlevels.append(indentlevel)

        #and append the above string as a folder level to the PARENTPATH

        PARENTPATH=PARENTPATH+previous

 

    #If indent level is reduced,

    elif(indentlevel < indentlevels[-1]):

 

        #pop the last element from the indentlevels

        indentlevels.pop() #ref1

 

        #If indent is reduced more than one step (eg. dirstruct4),

        #then pop the indentlevels until it's equal to indentlevel

        #and do the same for parent path

        while indentlevels[-1]>indentlevel:

            indentlevels.pop()

            PARENTPATH=PARENTPATH[:PARENTPATH[:-1].rindex('/')+1]  

            # Because I want /hai/bai/ to become /hai/

 

        #If this condition fails, then it means the file is not indented uniformly

        if not (indentlevel == indentlevels[-1]):

            print "SYNTAX ERROR.. INDENTATION MUST BE UNIFORM"

            sys.exit()

 

        #required for line '#ref1'

        PARENTPATH=PARENTPATH[:PARENTPATH[:-1].rindex('/')+1]  

 

    #else: #indentlevel == indentlevels[-1]

    touch(line)

 

    previous=line

    line=f.readline()

 

 

Try this sample dir structure:

project/

   SimpleM2Project/

       pom.xml

       src/

           main/

               java/

                   com/

                       onedash/

                           hello/

                               Hello.java

           test/

               java/

Posted via email from Art, Science & Technology

Inspiring Quotes from the author of "Learn Python the Hard Way"

I read this book "Learn Python the Hard Way" some time ago. Actually the book was very basic as it was intended for people new to programming so it's not my type of book. What I liked was at the end of the book, there was an epilogue called "Advice From An Old Programmer."

Link to that:

http://learnpythonthehardway.org/book/advice.html

I really really loved it. Following are my favorites:

My Manager used to tell this frequently:

Which programming language you learn and use doesn't matter. Do not get sucked into the religion surrounding programming languages as that will only blind you to their true purpose of being your tool for doing interesting things.

 

Programming is an art form. This is how I used to think always:

Programming as an intellectual activity is the only art form that allows you to create interactive art. You can create projects that other people can play with, and you can talk to them indirectly. No other art form is quite this interactive. Movies flow to the audience in one direction. Paintings do not move. Code goes both ways.

 

Sad truth:

People who can code in the world of technology companies are a dime a dozen and get no respect. People who can code in biology, medicine, government, sociology, physics, history, and mathematics are respected and can do amazing things to advance those disciplines.

 

Finally my most favorite part due to my personal experiences:

Finally, I'll say that learning to create software changes you and makes you different. Not better or worse, just different. You may find that people treat you harshly because you can create software, maybe using words like "nerd". Maybe you'll find that because you can dissect their logic that they hate arguing with you. You may even find that simply knowing how a computer works makes you annoying and weird to them.

To this I have just one piece of advice: they can go to hell. The world needs more weird people who know how things work and who love to figure it all out. When they treat you like this, just remember that this is your journey, not theirs. Being different is not a crime, and people who tell you it is are just jealous that you've picked up a skill they never in their wildest dreams could acquire.

You can code. They cannot. That is pretty damn cool.

Posted via email from Art, Science & Technology