Lucene Full Text Search in Java

October 31 200913 Commented

Categorized Under: Java

 
This tutorial is based on the following software environment.

Windows Vista Home Premium, Eclipse 3.5, JDK 1.6, Tomcat 6, Lucene 2.9, MySQL 5, MySQL JDBC driver 5.1

Imagine you are running a news portal where you are publishing number of articles in a day to day basis. Your customers get latest news in home page and inner pages. Suppose one customer wants to search for a particular news article which is published years back. Here comes the importance of full text search. Full text search means searching entire content based on user input. Google is an Internet search engine giant which crawls websites, index them and provides fast  and relevant search results.

Lucene API consists of core and contributed packages. Lucene do not know anything about web crawling, database access etc. Any data which should be indexed is taken as a Document object in which the information is kept in various fields. The index is stored in files under a specified directory. Lucene returns a collection of Document objects as search result.

So Lucene simply does indexing and searching based on unit called Document. For indexing you have to create Document object, create Field objects, add Fields to Document and index them. Suppose you want to add fulltext search for Employee information in RDBMS. Generalized steps are given below.

  • Get a reference to collection of all records in Employee table using JDBC or any other API.
  • Iterate through Employee records. Create Document Object. Create fields for all columns which should be indexed. Populate Field object with name and value of each column. Add fields to Document object.
  • Index is created using Lucene API
  • When a user does the search, we execute the search in index. Lucene API returns those Document objects in which the search terms are found. The search is executed in field elements of Document object.
  • Iterate through the Document collection and return results.

 

Lucene API is vast and consists of number of contributed APIs like highlighter, spellchecker etc in addition to core API. Let us examine some important components which are commonly used in Lucene based search implementation.

  • org.apache.lucene.document.Document – The unit for indexing and search. For indexing and searching you should create Document object.
  • org.apache.lucene.document.Field – Fields are sub-units of Document object. Every field has a name and corresponding value. Values are represented as string or reader.
  • org.apache.lucene.index.IndexWriter – IndexWriter is used to create and update index. In constructor you specify whether you want to create full index or update existing index. You have to specify the Directory object for storing index. Please note that Lucene allows to store index in files, memory and database. Use the appropriate subclasses of Directory object to store index.
  • org.apache.lucene.index.IndexReader – provides an interface for accessing index. This is an abstract class for which there are other sub-classes are available.
  • org.apache.lucene.search.IndexSearcher – used to search over IndexReader or given directory. IndexSearcher is thread safe so that we may use single object to search in multiple threads.
  • org.apache.lucene.analysis.Analyzer – Analyzer is used to analyse text. It consists of tokenizers and filters. For eg, it first splits the string to various tokens. Splitting into various tokens is based on some logic like using space delimiter. Then it apply the filter like removing some common stop words like is, and, It etc. Some implemented subclasses are available like StandardAnalyzer but you can implement your own subclasses. You have to use same analyzer for both indexing and searching in order to avoid any bad results.
  • org.apache.lucene.index.Term – Term is the unit of search which consists of two parts. The text of the word and the field in which the word appears.
  • org.apache.lucene.search.Query – The query object is used to search index. Query contains terms which are units of search. This is an abstract class and a number of implemented sub-classes are available. TermQuery is one such subclass which is used to find documents which contain term. BooleanQuery is used to combine multiple TermQuery objects.
  • org.apache.lucene.queryParser.QueryParser – Sometimes building query may be a complex task. QueryParser provides convenient method to build Query object from human entered text. If you are building query programmatically, it is better to use Query API directly. Suppose you are providing a full text option in your web application. You provide results for user entered text. In such cases you may use QueryParser object. It tokenizes and filters string using analyzer object and create appropriate Query object.

 

Now we are going to put everything in action.

If you haven’t had Eclipse IDE with Tomcat , refer the following tutorial.

Installing and Configuring Eclipse with Tomcat in Windows

If you are not comfortable with Servlets, refer the following tutorial.

Introduction to Servlets

JDBC, no look at it yet, refer the tutorial,

Introduction to JDBC

Well, let us get lucene first. Go to http://www.apache.org/dyn/closer.cgi/lucene/java/ and download zip version. Here I am using lucene-2.9.0.zip. Unzip it into a convenient folder in your system. This contains one core JAR file named lucene-core-[version].jar. We need just this JAR file for our application.

This tutorial is based on MySQL which is a popular open source database. If you want to know how to install MySQL database in your local Windows machine, please refer the ODBC tutorial.

Lotus Notes/Domino RDBMS integration using ODBC 

If your MySQL db is available in your network, you may access the same using the following information

  • Host Name – Name of the machine where MySQL is installed. Alternatively you may use IP address.
  • User Name / Password – for accessing MySQL

 

Ask your database administrator for the above details. You may download any SQL client like SQLYog for accessing database.

http://www.webyog.com/en/

Ok, first connect to MySQL from MySQL client. Then create a database named mydb. Switch to mydb database. Create an employee table with four fields.

  • employeeid int not null auto increment
  • name varchar(100)
  • age int
  • designation varchar(100)

Set the primary key as employeeid. Insert the following records into employee table.

Name Age Designation
Jinoy George 30 Programmer
Albert George 25 Programmer
Ravi Kumar 45 Manager
Anitha Kumari 50 Accountant

No need to enter employeeid column as this is an auto increment field.

Let us get into Eclipse side. Open Eclipse. Java EE perspective is the default perspective. If you are not sure go to Window–>Open Perspective–>Other. Select Java EE and click OK.

Go to File–>New–>Dynamic Web project. Project Wizard appears. 

Give the project name as LuceneExample. All other options do not require modification if you are using Eclipse with Tomcat as server. If you have configured multiple servers you may change the target runtime. Click Finish button. Now we have to add the project under server. Go to server window, right click on server entry and select Add and Remove. Move the project from left to right window in properties box. Click Finish button.

Now we need to put two JAR files in Tomcat–>lib directory. As I have specified in JDBC tutorial, any external JAR files should be placed in application root –>WEB-INF–>lib folder. Alternatively you may place the JAR files in application server classpath specific folder. If you are using Tomcat, any JAR file put under Tomcat root directory/lib folder is shared by all applications and Tomcat internal classes. Also all JAR files put under this folder are available in build path.

If you haven’t had JDBC driver for MySQL, go to http://www.mysql.com/products/connector/ and download JDBC driver. Unzip the downloaded file. It contains JAR file named mysql-connector-java-<version>-bin.jar file. This JAR file contains APIs for JDBC. We need only this file for our application.

Go to Windows explorer. Open Tomcat installed directory–>lib. Now put the JDBC JAR file and Lucene core JAR file inside this folder.

We have to create one directory to keep lucene index files. Though we may keep the index in memory, it is practically not possible for large index. Create a directory named lucene in your C: drive.

Ok, let us now create a class to index the employee information. In project explorer expand the project. Right click on src folder and select New–>Class. Java Class wizard appears. Give the package name as example and class name as IndexBuilder. Press finish button. IndexBuilder.java appears in editor. Now replace the content with the following code snippet.

 
package example;
 
import java.io.File;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
 
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
 
 
public class IndexBuilder {
 
	static final String LUCENE_INDEX_DIRECTORY = "C:\\lucene";
	static final String DB_HOST_NAME = "localhost";
	static final String DB_USER_NAME = "root";
	static final String DB_PASSWORD = "jinoy";
 
	//method for indexing
	public void buildIndex(){
 
		Connection con = null;
		Statement stmt = null;
		ResultSet rs = null;
		IndexWriter writer=null;
		StandardAnalyzer analyzer = null;		
		File file = null;
		try{
			System.out.println("Start indexing");
			//get a reference to index directory file
			file = new File(LUCENE_INDEX_DIRECTORY);
			analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
			writer = new IndexWriter(
					FSDirectory.open(file),
					analyzer,
					true,
					IndexWriter.MaxFieldLength.LIMITED
					);
 
 
			//initialize the driver class
			Class.forName("com.mysql.jdbc.Driver").newInstance();
			//get connection object
			con = DriverManager.getConnection(
					"jdbc:mysql://"+DB_HOST_NAME+"/mydb",
					DB_USER_NAME, DB_PASSWORD);
			//create statement object
			stmt = con.createStatement();
			//execute query
			rs = stmt.executeQuery("SELECT * FROM employee");
			//iterate through result set
			while(rs.next()){
				String id = rs.getString("employeeid");
				String name = rs.getString("name");
				String age = rs.getString("age");
				String designation = rs.getString("designation");
				//create a full text field which contains name,
				//age and designation
				String fulltext = name + " " + age +
				" " + designation;
 
				//create document object
				Document document = new Document();
				//create field objects and add to document				
				Field idField = new Field("employeeid", 
						id, Field.Store.YES,
						Field.Index.NO);
				document.add(idField);
				Field nameField = new Field("name",
						name, Field.Store.YES,
						Field.Index.ANALYZED);
				document.add(nameField);
				Field ageField = new Field("age",
						age, Field.Store.YES,
						Field.Index.NOT_ANALYZED);
				document.add(ageField);
				Field designationField = new Field("designation",
						designation, Field.Store.YES,
						Field.Index.ANALYZED);
				document.add(designationField);
				Field fulltextField = new Field("fulltext",
						fulltext, Field.Store.NO,
						Field.Index.ANALYZED);
				document.add(fulltextField);
				//add the document to writer
				writer.addDocument(document);
			}
			//optimize the index
			System.out.println("Optimizing index");
			writer.optimize();
 
		}catch(Exception e){
			e.printStackTrace();
		}finally{
			try{
				if(writer!=null)
					writer.close();
				if(rs!=null)
					rs.close();
				if(stmt!=null)
					stmt.close();
				if(con!=null)
					con.close();
				System.out.println("Finished indexing");
 
			}catch(Exception ex){
				ex.printStackTrace();
			}
 
		}
 
 
	}
 
	public static void main(String[] args) throws Exception {
 
		IndexBuilder builder = new IndexBuilder();
		builder.buildIndex();
	}
 
}

In the above code, you have to make some changes. Some static final variables are defined which you may change according to your setup. First variable is about lucene index directory where you want to store index. You have to give full folder path for it. Next three variables are related to MySQL db where you have to change host name, user name and password according to your setup. Once you make changes, save the file.

Now let us run this java application. Right click on IndexBuilder.java and select Run As–>Java Application.

lucene_1

If everything went smoothly, you get “Finished Indexing” message in console. Now go to lucene index directory and you could see the index files.

lucene_2

 Now let us walk through the code. In main method we create an instance of IndexBuilder object and call its buildIndex() method. In buildIndex() method we create a File object which repesents our index directory. Then we create a StandardAnalyzer object. StandardAnalyzer is extended from Analyzer class. It splits the text into words, convert them to lower case and remove some common words like is, this etc. You may use any other analyzer or create your own analyzer.

IndexWriter object is instantiated by passing parameters to the constructor. The first parameter is Directory object. Since we are using file based index, we use FSDirectory object which is extended from abstract class Directory. Second parameter is the analyzer object which we already created. Third parameter specifies whether we should create or update index. In our case, we are creating index. Fourth parameter specifies the number of tokens which should be indexed. In our case, we are using default number which is 10, 000. If our field value is of type text and consists of more than 10, 000 words excluding stop words, rest will be ignored. But you can select number of tokens unlimited also.

Next section is about JDBC API. We establish connection with database, query employee table and iterate through the result set. During iteration we create Document object. Field objects are created for all columns by providing column name/value pair. We create another field called fulltext by combining the values of name, age and designation fields. This is the default field where we do the search. We pass four parameters to the Field constructor. First parameter is the name of the field. We provide column name as field name. Second parameter is the field value which is again given by column value. Third parameter specifies whether we should store the field value in the index. If we store the field value, we may retrieve it later in its original form from the index rather than quering from database again. This is Ok for small sized fields but not recommended for large text fields. In our example we store all fields except fulltext field. Fourth parameter is about indexing. We are not indexing employeeid field. This means this field is not searchable. Since id field is a unique integer value representing employee record, there is no point in searching it. Name and designation fields are indexed and tokenized. This means field value is split into tokens and indexed. ANALYZED constant represents it. Age field is indexed but not tokenized. Since it is an integer value, there is no point in splitting it. Full text field is not stored but indexed and tokenized. Then we add all fields to Document object and Document object is adeded to IndexWriter object.

Finally we optimize the index and close all resources like IndexWriter and Connection objects.

Ok, now our index is ready. Let us create a servlet to test it. Right click on src folder and select New–>Servlet. Give the package name as example and Servlet name as SearchServlet. Press finish button. Replace the content with the following code snippet.

 
package example;
 
import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
 
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
 
import org.apache.lucene.analysis.standard.
StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
 
/**
 * Servlet implementation class SearchServlet
 */
public class SearchServlet extends HttpServlet {
	private static final long serialVersionUID = 1L;
	static final String LUCENE_INDEX_DIRECTORY = "C:\\lucene";
 
    /**
     * @see HttpServlet#HttpServlet()
     */
    public SearchServlet() {
        super();
        // TODO Auto-generated constructor stub
    }
 
	/**
	 * @see HttpServlet#doGet(HttpServletRequest request,
	 *  HttpServletResponse response)
	 */
	protected void doGet(HttpServletRequest request,
			HttpServletResponse response)
			throws ServletException,IOException {
		// TODO Auto-generated method stub
		doPost(request, response);
	}
 
	/**
	 * @see HttpServlet#doPost(HttpServletRequest request,
	 *  HttpServletResponse response)
	 */
	protected void doPost(HttpServletRequest request,
			HttpServletResponse response)
			throws ServletException, IOException {
		// TODO Auto-generated method stub
		response.setContentType("text/html");
		PrintWriter pw = response.getWriter();
		//print HTML header information
		pw.println("<HTML>");
		pw.println("<HEAD><TITLE>Lucene Example</TITLE></HEAD>");
		pw.println("<BODY>");
 
		//print the HTML form to search
		pw.println("<FORM ACTION=\"SearchServlet\" METHOD=\"POST\">");
		pw.println("<TABLE BORDER=\"0\">");
		pw.println("<TR>");
		pw.println("<TD>Enter Text</TD>");
		pw.println("<TD><INPUT NAME=\"query\" TYPE=\"TEXT\"></TD>");
		pw.println("</TR>");
		pw.println("<TR>");
		pw.println("<TD COLSPAN=\"2\"><INPUT TYPE=\"SUBMIT\"></TD>");
		pw.println("</TR>");
		pw.println("</TABLE>");
 
		//check whether this page is opened for first time or after
		//submitting search
		if(request.getParameter("query")==null ||
				request.getParameter("query").equals("")){
			pw.println("</BODY>");
			pw.println("</HTML>");
			return;
		}
 
		IndexReader reader = null;
		StandardAnalyzer analyzer = null;
		IndexSearcher searcher = null;
		TopScoreDocCollector collector = null;
		QueryParser parser = null;
		Query query = null;
		ScoreDoc[] hits = null;
 
		try{
			//store the parameter value in query variable
			String userQuery = request.getParameter("query");
			//create standard analyzer object
			analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
			//create File object of our index directory
			File file = new File(LUCENE_INDEX_DIRECTORY);
			//create index reader object
			reader = IndexReader.open(FSDirectory.open(file),true);
			//create index searcher object
			searcher = new IndexSearcher(reader);
			//create topscore document collector
			collector = TopScoreDocCollector.create(1000, false);
			//create query parser object
			parser = new QueryParser("fulltext", analyzer);
			//parse the query and get reference to Query object
			query = parser.parse(userQuery);
			//search the query
			searcher.search(query, collector);
			hits = collector.topDocs().scoreDocs;
			//check whether the search returns any result
			if(hits.length>0){
				//print heading
				pw.println("<P><TABLE BORDER=\1\">");
				pw.println("<TR><TD>Name</TD><TD>Age</TD><TD>" +
						"Designation</TD></TR>");				
 
				//iterate through the collection and display result
				for(int i=0; i<hits.length; i++){
					int scoreId = hits[i].doc;
					//now get reference to document
					Document document = searcher.doc(scoreId);
					pw.println("<TR><TD>"+document.getField(
					"name").stringValue()+"</TD><TD>"+
					document.getField("age").stringValue()+
					"</TD><TD>"+document.getField(
					"designation").stringValue()+
					"</TR>");
				}
				pw.println("</TABLE>");
			}else{
				pw.println("<P>No records found");
			}
 
 
		}catch(Exception e){
			e.printStackTrace();
		}finally{
			if(reader!=null)
				reader.close();
		}
		pw.println("</BODY>");
		pw.println("</HTML>");
 
	}
 
}

If your index directory is not the default C:\\lucene directory you may change it in LUCENE_INDEX_DIRECTORY static field. Ok, save the file. Now let us run the servlet. Go to browser, enter the following URL and hit enter.

http://localhost:8080/LuceneExample/SearchServlet

You get input form in browser. Now enter some text in the field and press submit button. For eg, just input programmer and hit enter. If you get results, congratulations! You finished your first step towards full text search in Java. Search any word or combination of words of name, designation and age fields.

Now let us walk through the code of Servlet. First we print the HTML header information. Then we print HTML form to input search text. We check whether query parameter is null or empty. If the user has submitted the search query, this will not be empty. query is the name attribute of the input field. Then we create IndexReader and IndexSearcher objects. Here also we use StandardAnalyzer because we used the same while indexing. QueryParser object allow us to create Query object from human entered text. It is created using analyzer and the field used for searching. Here we use fulltext field to do searching because it contains information from all fields. TopScoreDocCollector is the collection of returned hits. As a parameter value we may specify the number of hits to be returned. We call the search method of IndexSearcher object using query and collector objects. From TopScoreDocCollector we get reference to ScoreDoc which allow us to get internal document id. This id could be used to get a reference to Document object from IndexSearcher object. The field values are extracted from Document object and print in tabular format.

Finally we close the IndexReader object and HTML tags are closed.

cheers,

13 Responses to “Lucene Full Text Search in Java”

  1. lynn says:

    Thank you for the great tutorial. Is’t possible to retrieve the employee image as well??? Thank you..

  2. jigar gandhi says:

    thank u very much…

    but there is 1 mistake in servlet code

    replace this “parser = new QueryParser(“fulltext”, analyzer);”

    by
    parser = new QueryParser(Version.LUCENE_30,”fulltext”, analyzer);

    then it works fine…….

  3. Braden says:

    Hey,

    Thanks for the tut but I noticed that the constructor for IndexWriter you used is now deprecated…What would you use out of the newer supported constructors?

    Thanks again,
    Braden

  4. Sala de leos…

    http://www.bardesignidee.com projeto moderno da cozinha…

  5. dabens says:

    You the best thank You =)

  6. KGhosh says:

    Great tutorial! Thanks a lot for your detail guide.

  7. Thanhtin says:

    Great!

  8. jones says:

    great tutorial!

  9. Richard Campos says:

    Congratulations, your website is great, and this tutorial was very helpful for me. Thank you very much, for real.

  10. Chris Johse says:

    Thanks a lot for your detail guide, keep your work going and I will looking forward to see more..

  11. Brijesh V says:

    Great work Jinoy..keep it up..Looking forward to see more from you.

  12. mkyong says:

    Thanks for the detail guide, Lucene and Hyper Estraier are both my favorite full text search engine

  13. kris says:

    Grea!

    Thanks a lot, was looking for something like this since yesterday and lucky me, I’ve found it on Dzone!

Leave a Reply

*