In my last blog post, Setting up Full-Text Search for PDF files, I detailed how to get things setup. If you tried this you may have noticed that although the searches worked, what you got back was a file name. This isn't so helpful if your document is an all encompassing 538 pages. So, how do we get a page number back? The best I've come up with so far is to split the 538 pages into 538 documents and load / search on those.
My first google search on how to split a pdf into pages came back with, http://www.splitpdf.com/, so I went ahead and used that. I'm sure there is a way to do this through acrobat or even roll your own split functionality via the API.
The website was simple enough. Drag and drop a file, check the "Extract all pages into separate files", click "Split!", download the zip, and extract.
Now we have to load 538 pages into our table. There are multiple ways to do this; it boils down to a directory listing and string parsing to generate the inserts. Here's how I did it:
Sp_configure 'show advanced options',1
Sp_configure 'xp_cmdshell', 1
SET nocount ON
CREATE TABLE #inputfiles
seqid [NUMERIC](3, 0) IDENTITY (1,1),
INSERT INTO #inputfiles
--displays full path
EXEC xp_cmdshell 'dir "D:\Files\User Guide" /s/b'
--I ended up with a null row for some reason...
DELETE #inputfiles WHERE fn IS NULL
--Loop the temp table performing the inserts
DECLARE @Counter INT
DECLARE @Max INT
DECLARE @SQL NVARCHAR(max)
DECLARE @FileName NVARCHAR(max)
SET @Counter = 1;
SET @Max = (SELECT Max(seqid) FROM #inputfiles);
WHILE @Counter <= @Max
SET @FileName = (SELECT #inputfiles.fn FROM #inputfiles WHERE seqid = @Counter) SET @SQL ='INSERT INTO [FULLTEXTDEMO].[dbo].[files] ([filename], [filetype], [file]) SELECT '+Char(39)+@FileName+Char(39)+','+Char(39)+'.pdf'+Char(39)+',bulkcolumn FROM OPENROWSET(BULK '+Char(39)+@FileName+Char(39)+', single_blob) AS input'
EXECUTE sp_executesql @SQL
SET @Counter = @Counter + 1
DROP TABLE #inputfiles
SET nocount OFF
Sp_configure 'xp_cmdshell', 0
Sp_configure 'show advanced options',0
You can verify the population by querying the FULLTEXTCATALOGPROPERTY PopulateStatus:
SELECT Fulltextcatalogproperty(c.NAME, 'PopulateStatus')
FROM sys.fulltext_catalogs c
Here's a search I ran on the example data:
SELECT ( 'file://' + [filename] ) AS [File]
WHERE CONTAINS([file], 'NEAR(("Admin*","Client"),2)')
FOR xml raw