10/11/2018

Oracle CLOB to Solr via the Data Import Handler


Image result for clobber time meme
no copyright infringement is intended


In this example I build off of the previous example, by adding an Oracle CLOB field.

All in all, this is pretty simple and requires just a few updates:
  1. Update the Oracle table to have a CLOB column
  2. Update the collections data config file to use the ClobTransformer
  3. Update the managed-schema to use a TextField data type

Update the Oracle table to have a CLOB column

CREATE TABLE "[schema]"."SOLR_TEST" 
( 
"ID" NUMBER generated always AS identity, 
"TIME_STAMP" timestamp (6) DEFAULT sys_extract_utc(systimestamp), 
"CATEGORY"   VARCHAR2(255 byte), 
"TYPE"       VARCHAR2(255 byte), 
"SERVERNAME" VARCHAR2(255 byte), 
"CODE"       VARCHAR2(255 byte), 
"MSG"        CLOB 
);

Update the collections data config file to use the ClobTransformer

<dataconfig> 
<datasource driver="oracle.jdbc.OracleDriver" url="jdbc:oracle:thin:...[put your connections string here] />
<document>
<entity name = "SOLR_TEST" transformer="ClobTransformer" "query="SELECT IDto_char(TIME_STAMP,'YYYY-MM-DD&quot;T&quot;HH24:MI:SS&quot;Z&quot;'AS TIME_STAMP, CATEGORY, TYPE, SERVERNAME, CODE, MSG FROM SOLR_TEST" deltaImportQuery="SELECT ID, to_char(TIME_STAMP,'YYYY-MM-DD&quot;T&quot;HH24:MI:SS&quot;Z&quot;') AS TIME_STAMP, CATEGORY, TYPE, SERVERNAME, CODE, MSG FROM SOLR_TEST WHERE  ID = ${dataimporter.delta.ID}" deltaQuery="SELECT ID
FROM SOLR_TEST WHERE  TIME_STAMP > TO_TIMESTAMP('${dataimporter.last_index_time}', 'YYYY-MM-DD HH24:MI:SS')">

<field column = "ID" name="id" />
<field column = "TIME_STAMP" name="time_stamp" />
<field column = "CATEGORY" name="category" />
<field column = "TYPE" name="type" />
<field column = "SERVERNAME" name="servername" />
<field column = "CODE" name="code" />
<field column = "MSG" name="msg" clob="true" sourceColName="MSG" />
</entity>
</document>
</dataconfig>

Update the managed-schema to use a TextField data type


...
<field name="msg" type="text_en" indexed="true" stored="true" multiValued="false" />
...
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.EnglishPossessiveFilterFactory" /> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" /> <filter class="solr.PorterStemFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.EnglishPossessiveFilterFactory" /> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" /> <filter class="solr.PorterStemFilterFactory" /> </analyzer> </fieldType>
...

Full disclosure, I built off of the "example-data-driven-schema" and the field type comes directly from there.

That's it!  You can now run a full import, make some additions, and run a delta. The most difficult part of this process, for me, was getting 4000+ characters inserted into the clob field in the Oracle table.  Here's what that looks like:


DECLARE
  my_clob CLOB;
BEGIN
  my_clob := '4000 plus characters...';
  INSERT INTO [SCHEMA].solr_test
              (
                          category,
                          TYPE,
                          servername,
                          code,
                          msg
              )
              VALUES
              (
                          'Manual',
                          'SQL Insert',
                          'hostname',
                          'BEA-100000',
                          my_clob
              );

END;