6/23/2016

Full-Text Search with Page Numbers

In my last blog post, Setting up Full-Text Search for PDF files, I detailed how to get things setup.  If you tried this you may have noticed that although the searches worked, what you got back was a file name.  This isn't so helpful if your document is an all encompassing 538 pages.  So, how do we get a page number back?  The best I've come up with so far is to split the 538 pages into 538 documents and load / search on those.

My first google search on how to split a pdf into pages came back with, http://www.splitpdf.com/, so I went ahead and used that.  I'm sure there is a way to do this through acrobat or even roll your own split functionality via the API.

The website was simple enough.  Drag and drop a file, check the "Extract all pages into separate files", click "Split!", download the zip, and extract.


Now we have to load 538 pages into our table.  There are multiple ways to do this; it boils down to a directory listing and string parsing to generate the inserts.  Here's how I did it:

--enable xp_cmdshell
Sp_configure 'show advanced options',1 
RECONFIGURE
go

Sp_configure 'xp_cmdshell', 1 
RECONFIGURE
go

SET nocount ON
go

CREATE TABLE #inputfiles
             ( 
                          seqid [NUMERIC](3, 0) IDENTITY (1,1), 
                          fn NVARCHAR(max) 
             );

INSERT INTO #inputfiles 
--displays full path 
EXEC xp_cmdshell 'dir "D:\Files\User Guide" /s/b' 

--I ended up with a null row for some reason...
DELETE #inputfiles WHERE  fn IS NULL 
--Loop the temp table performing the inserts

DECLARE @Counter  INT
DECLARE @Max      INT
DECLARE @SQL      NVARCHAR(max)
DECLARE @FileName NVARCHAR(max)

SET @Counter = 1;
SET @Max (SELECT Max(seqid) FROM #inputfiles);

WHILE @Counter <= @Max 
BEGIN 
  SET @FileName = (SELECT #inputfiles.fn FROM #inputfiles WHERE seqid = @Counter)  SET @SQL ='INSERT INTO [FULLTEXTDEMO].[dbo].[files] ([filename], [filetype], [file]) SELECT '+Char(39)+@FileName+Char(39)+','+Char(39)+'.pdf'+Char(39)+',bulkcolumn FROM OPENROWSET(BULK '+Char(39)+@FileName+Char(39)+', single_blob) AS input'
  EXECUTE sp_executesql @SQL 
  SET @Counter = @Counter + 1 
END 

--Cleanup
DROP TABLE #inputfiles

SET nocount OFF
go

Sp_configure 'xp_cmdshell', 0 
RECONFIGURE
go

Sp_configure 'show advanced options',0 
RECONFIGURE
go

You can verify the population by querying the FULLTEXTCATALOGPROPERTY PopulateStatus:

SELECT Fulltextcatalogproperty(c.NAME, 'PopulateStatus') 
FROM   sys.fulltext_catalogs c 

Here's a search I ran on the example data:
SELECT ( 'file://' + [filename] ) AS [File] 
FROM   [FULLTEXTDEMO].[dbo].[files] 
WHERE  CONTAINS([file], 'NEAR(("Admin*","Client"),2)') 
FOR xml raw 

Click

Click
Boom

No comments:

Post a Comment