6/01/2016

Setting up Full-Text Search for PDF files

Has this ever happened to you?  You're assigned to work on a new project with a new vendor.  The vendor has a website that's very difficult to browse for documentation.  Some of it's under one domain name, some of its under another domain name.  Some of the relevant documentation for the version you're on is actually published under a prior version or edition.  There's plenty of documentation, it's scattered about, and it's all...PDF files.

Faced with this very issue, I decided to setup a local SQL Server Full-Text Search.
Some of the cool things Full-Text Search will give you, over and above, a standard search include the following:
  • One or more specific words or phrases (simple term)
  • A word or a phrase where the words begin with specified text (prefix term)
  • Inflectional forms of a specific word (generation term)
  • A word or phrase close to another word or phrase (proximity term)
  • Synonymous forms of a specific word (thesaurus)
  • Words or phrases using weighted values (weighted term)
In order to get stared with the setup, it's important to know that the Full-Text Search architecture relies on filters for searching various file types.  This is important for this example because the PDF filter is not installed by default.  So, for starters, we need to go download and install the PDF ifilter (PDFFilter64Setup.msi).

This is a next, next, finish install:



There is a trick with getting the installation setup correctly; you have to setup a system variable.  Here's how I did it.


Environment Variables:

From Environment Variables I first created a new Variable "AdobeiFilter" and pointed it to the bin directory associated with the install:



I then clicked Edit on the Path Variable, in environment Variables, and appended %AdobeiFilter% to the end:

I can validate this by opening run and typing "%AdobeiFilter%" which drops me to my bin directory here: "C:\Program Files\Adobe\Adobe PDF iFilter 11 for 64-bit platforms\bin"


You may also need to grant your SQL Server Service account access to this folder.

Next we'll configure SQL Server Full-Text.  You can check to see if it's already installed by running the following:
SELECT Serverproperty('IsFullTextInstalled') 

If it's not installed you can run through the standard SQL Installation choosing "New SQL Server stand-alone installation or add features to an existing installation":

The feature is listed as "Full-Text and Semantic Extractions for search":



Next we run a couple of sp_fulltext_service commands:
--Load operating system filters and word breakers 
EXEC Sp_fulltext_service 
  @action='load_os_resources', 
  @value=1; 

--Do not verify whether binaries are signed 
EXEC Sp_fulltext_service 
  @action='verify_signature', 
  @value=0; 

Validate SQL is associating Full-Text and the PDF filter:
SELECT document_type, 
       path 
FROM   sys.fulltext_document_types 
WHERE  document_type = '.pdf' 

You may need to restart the SQL Service, possibly the box, at this point if the association isn't there.

For this blog I created a new database, "FULLTEXTDEMO".  By default full-text indexing was enabled at the time of creation.  This can also be checked with the following query:
SELECT NAME, 
       is_fulltext_enabled 
FROM   sys.databases 
ORDER  BY NAME 

Next I created a table to hold my full-text pdf files:
USE [FULLTEXTDEMO] 
go 

CREATE TABLE [dbo].[files] 
  ( 
     [id]       [INT] IDENTITY(1, 1) NOT NULL, 
     [filename] [NVARCHAR](max) NULL, 
     [filetype] VARCHAR(5) NULL, 
     --File type is required for creating a Full-Text Index 
     [file]     [VARBINARY](max) NULL, 
     CONSTRAINT [PK_files] PRIMARY KEY CLUSTERED ( [id] ASC )WITH (pad_index = 
     OFF, statistics_norecompute = OFF, ignore_dup_key = OFF, allow_row_locks = 
     on, allow_page_locks = on) ON [PRIMARY] 
  ) 
ON [PRIMARY] 
textimage_on [PRIMARY] 

Next we create a Full-Text Catalog
USE [FULLTEXTDEMO] 
go 
CREATE fulltext catalog [FILESCATALOG] 
go 

...and a full-text index to go with it:


Statistical Semantic Search is really interesting, and it requires a separate setup.  Details are here, if you're interested:
Install and Configure Semantic Search





Next, I loaded a couple of files:

INSERT INTO [FULLTEXTDEMO].[dbo].[files] 
            ([filename], 
             [filetype], 
             [file]) 
SELECT 'D:\Files\Administrators Guide.pdf', 
       '.pdf', 
       bulkcolumn 
FROM   OPENROWSET(BULK 'D:\Files\Administrators Guide.pdf', single_blob) AS 
       input 

INSERT INTO [FULLTEXTDEMO].[dbo].[files] 
            ([filename], 
             [filetype], 
             [file]) 
SELECT 'D:\Files\Installation Guide.pdf', 
       '.pdf', 
       bulkcolumn 
FROM   OPENROWSET(BULK 'D:\Files\Installation Guide.pdf', single_blob) AS input 

Finally, some fun with Query with Full-Text Search

--Simple Term 
SELECT [filename] 
FROM   [FULLTEXTDEMO].[dbo].[files] 
WHERE  CONTAINS([file], '"SQL Server"') 

--Prefix Term 
SELECT [filename] 
FROM   [FULLTEXTDEMO].[dbo].[files] 
WHERE  CONTAINS([file], '"MS*"') 

--Generation Term 
SELECT [filename] 
FROM   [FULLTEXTDEMO].[dbo].[files] 
WHERE  CONTAINS([file], 'FORMSOF(INFLECTIONAL, "install")') 

--Proximity Term 
SELECT [filename] 
FROM   [FULLTEXTDEMO].[dbo].[files] 
WHERE  CONTAINS([file], 'NEAR((".NET","SQL Server"),2)') 

--Weighted Term 
SELECT F.[filename], 
       KEY_TBL.rank 
FROM   [FULLTEXTDEMO].[dbo].[files] AS F 
       INNER JOIN CONTAINSTABLE([FULLTEXTDEMO].[dbo].[files], [file], 
                  'ISABOUT ("Install*",           Windows WEIGHT(0.1),           SQL WEIGHT(0.9)          ) ' ) AS 
       KEY_TBL 
               ON F.id = KEY_TBL.[key] 
ORDER  BY KEY_TBL.rank DESC 

--PROTIP: if you do it like this you can get a clickable link 
SELECT ( 'file://' + [filename] ) AS [File] 
FROM   [FULLTEXTDEMO].[dbo].[files] 
WHERE  CONTAINS([file], '"SQL Server"') 
FOR xml raw 

You can also get crazy with custom synonyms via: Configure and Manage Thesaurus Files for Full-Text Search

1 comment: