Showing posts with label Dgidx. Show all posts
Showing posts with label Dgidx. Show all posts

Ignore character accents when indexing text - Endeca Record Search Best Practice

How to ignore character accents when indexing text? 
By Default café is indexed separately from cafe. 
Setting --diacritic-folding flag at Dgidx will index café as cafe and setting it at Dgraph level makes search matches with either term (Cafe or café).

How to configure at both Dgidx and Dgraph level?
Dgidx:
Using the --diacritic-folding flag on Dgidx causes accented characters to be mapped to simple ASCII equivalents.

Dgraphs:  
Using the --diacritic-folding flag on the Dgraph allows Anglicized search queries such as cafe to match against result text containing international characters (accented) such as café.

Stemming and Thesaurus - Endeca Features


Stemming  - The stemming dictionary is based on the common English dictionary, and doesn’t pluralize proper nouns, brand names, etc.  In order to ‘Stem’ a plural of a word that doesn’t occur commonly, a two way thesaurus entry should be made in the Workbench or update the stemming dictionary.

      Stemming  dictionary found at MDEX level /opt/app/endeca//MDEX/6.5.1/conf/stemming

    Thesaurus  -  The thesaurus is intended for specifying concept-level mappings between words and phrases.
There are two options available when configuring thesaurus entries:
1)      One-Way - This mapping technique specifies only one direction of equivalence. 
                 Ex: Assume you define a one-way mapping from the phrase “red wine” to the phrases “merlot” and “cabernet sauvignon”.  This one-way mapping ensures that a search for “red wine” also returns any matches containing the more specific terms “merlot” or “cabernet sauvignon.”
2)      Two-Way - This technique means that the direction of a word mapping is equivalent between the words.
                  Ex: a Two-way mapping between “laptop,” “desktop,” and “notebook” means that a search for one of these words will always return all results matching any of these words

Note: Only one global thesaurus is supported for an Endeca data domain. In other words, language-specific thesauruses are not supported (for example, one thesaurus for English, a second for French, and so on).  

Endeca ENEConnectionException - How to Enable backwards compatibility, so that the Dgraph can communicate with previous versions of the Presentation API

ERROR MESSAGE in Endeca JSP reference application: ENEConnectionException com.endeca.navigation.ENEConnectionException: Error establishing connection to retrieve Navigation Engine request

Enable backwards compatibility, so that the Dgraph can communicate with previous versions of the Presentation API. In addition to the currently supported version of the Presentation API, the following previous full versions are supported: 6.0.x, 5.1.x, 5.0.x and 4.8.x. Therefore, the value for <api-version> must be one of the following:
  • 601 for all 6.0.x versions of the API.
  • 510 or 500 for all corresponding versions of the API, 5.1.x and 5.0.x.
  • 480 for the 4.8.x versions of the API, including the Perl API. 
<dgraph-defaults>
    <properties>
     ...
    <directories>
     ...
    </directories>
    <args>
       <arg>--back_compat</arg> 
       <arg>601</arg> 
       ...
        </args>
    <startup-timeout>120</startup-timeout>
  </dgraph-defaults>


Note: Starting with version 6, the Endeca Presentation API is part of the Platform Services package. For the version of the Platform Services that is compatible with the current version of the MDEX Engine.

How to add Endeca Record Sort options in ATG 11.1? - Core Endeca

1) Go to /opt/app/endeca/apps/CRS/config/mdex/record_sort_config.xml and add the desired properties that can sort the result set.
Example:
<RECORD_SORT_CONFIG>
  <RECORD_SORT NAME="product.displayName"/>
  <RECORD_SORT NAME="product.salePrice"/>
  <RECORD_SORT NAME="product.bvRatings"/>
</RECORD_SORT_CONFIG>

2) Run a baseline index to create the sorts in the MDEX.

Endeca - Stop words

Endeca Stop words
Stop words are words that are ignored by the MDEX Engine when the words are part of a keyword search.

Where I can find OOTB Stop words?
Endeca 3.1.X
/opt/apps/endeca/apps/CRS/config/CRS.stop_words.xml
Endeca 11.X
/opt/app/endeca/apps/CRS/config/mdex/CRS.stop_words.xml

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<!DOCTYPE STOP_WORDS SYSTEM "stop_words.dtd">
<STOP_WORDS>

  <STOP_WORD>a</STOP_WORD>
  <STOP_WORD>do</STOP_WORD>
  <STOP_WORD>me</STOP_WORD>
  <STOP_WORD>when</STOP_WORD>
  <STOP_WORD>about</STOP_WORD>

  <STOP_WORD>find</STOP_WORD> 
 <STOP_WORD>not</STOP_WORD>
  <STOP_WORD>where</STOP_WORD>
  <STOP_WORD>above</STOP_WORD>

  <STOP_WORD>for</STOP_WORD>
  <STOP_WORD>or</STOP_WORD>
  <STOP_WORD>why</STOP_WORD>
  <STOP_WORD>an</STOP_WORD>
  <STOP_WORD>from</STOP_WORD>
  <STOP_WORD>over</STOP_WORD>
  <STOP_WORD>with</STOP_WORD>
  <STOP_WORD>and</STOP_WORD>
  <STOP_WORD>have</STOP_WORD>
  <STOP_WORD>show</STOP_WORD>
  <STOP_WORD>you</STOP_WORD>
  <STOP_WORD>any</STOP_WORD>
  <STOP_WORD>how</STOP_WORD>
  <STOP_WORD>the</STOP_WORD>
  <STOP_WORD>your</STOP_WORD>
  <STOP_WORD>are</STOP_WORD>
  <STOP_WORD>I</STOP_WORD>
  <STOP_WORD>under</STOP_WORD>
  <STOP_WORD>can</STOP_WORD>
  <STOP_WORD>is</STOP_WORD>
  <STOP_WORD>what</STOP_WORD>
</STOP_WORDS>

 You can add application specific stop words in above mentioned file and run Endeca baseline update.

Endeca Forge process completed but fails at Dgidx process - Endeca Tips

Problem Statement:
Forge process completed but fails at Dgidx process

$ ./baseline_update.sh
 BASELINE SCRIPT STARTED
[07.12.15 19:37:32] INFO: Released lock 'update_lock'.
 Copy crawl files to incoming and processing folder
[07.12.15 19:37:33] INFO: Checking definition from AppConfig.xml against existing EAC provisioning.
[07.12.15 19:37:34] INFO: Definition has not changed.
[07.12.15 19:37:34] INFO: Starting baseline update script.
[07.12.15 19:37:34] INFO: Acquired lock 'update_lock'.
[07.12.15 19:37:34] INFO: [ITLHost] Starting shell utility 'cleanDir_processing'.
[07.12.15 19:37:35] INFO: [ITLHost] Starting shell utility 'move_-_to_processing'.
[07.12.15 19:37:37] INFO: [ITLHost] Starting copy utility 'fetch_config_to_input_for_forge_Forge'.
[07.12.15 19:37:38] INFO: [ITLHost] Starting backup utility 'backup_log_dir_for_component_ConfigurationGeneratorForge'.
[07.12.15 19:37:39] INFO: [ITLHost] Starting component 'ConfigurationGeneratorForge'.
[07.12.15 19:38:08] INFO: [ITLHost] Starting backup utility 'backup_log_dir_for_component_Forge'.
[07.12.15 19:38:10] INFO: [ITLHost] Starting component 'Forge'.
[07.12.15 19:38:24] INFO: [ITLHost] Starting backup utility 'backup_log_dir_for_component_Dgidx'.
[07.12.15 19:38:25] INFO: [ITLHost] Starting component 'Dgidx'.
[07.12.15 19:38:27] SEVERE: Batch component  'Dgidx' failed. Refer to component logs in /opt/app/endeca/apps/APP_NAME/./logs/dgidxs/Dgidx on host ITLHost.
Occurred while executing line 26 of valid BeanShell script:[[

23|        Forge.archiveLogDir();
24|        Forge.run();
25|        Dgidx.archiveLogDir();
26|        Dgidx.run();
27|
28|        // distributed index, update Dgraphs
29|        DistributeIndexAndApply.run();]][07.12.15 19:38:27] SEVERE: Caught an exception while invoking method 'run' on object 'BaselineUpdate'. Releasing locks.Caused by java.lang.reflect.InvocationTargetException
sun.reflect.NativeMethodAccessorImpl invoke0 - null
Caused by com.endeca.soleng.eac.toolkit.exception.AppControlException
com.endeca.soleng.eac.toolkit.script.Script runBeanShellScript - Error executing valid BeanShell script.
Caused by com.endeca.soleng.eac.toolkit.exception.EacComponentControlException
com.endeca.soleng.eac.toolkit.component.BatchComponent run - Batch component  'Dgidx' failed. Refer to component logs in /opt/app/endeca/apps/APP_NAME/./logs/dgidxs/Dgidx on host ITLHost.

[07.12.15 19:38:27] INFO: Released lock 'update_lock'.

Solution:
Look at below log location
/opt/app/endeca/apps/APP_NAME/logs/dgidxs/Dgidx/Dgidx.log
FATAL   07/13/15 00:38:26.899 UTC (1436747906899)       DGIDX   {dgidx,baseline}        ENE Indexer: Error processing records file.
WARN    07/13/15 00:38:26.899 UTC (1436747906899)       DGIDX   {dgidx,baseline}        Lexer/OLT log: level=-1: 2015/07/12 19:38:26 | INFO    | Disabling log callback


If you see above FATAL ERROR in Dgidx log that means forge process completed successfully but without producing output for dgidx process to index and hence dgidx process failed.

Refer below blog for more details:
http://ravihonakamble.blogspot.com/2015/06/how-to-read-data-from-endeca-cas-record.html

Endeca MDEX (Dgrpahs) - Few items on Endeca Dgrpah side to improve index and query time


   1) Thread configured should be equal to the Number of cores - 1
       Example:  If Cores are 4 then define it as 3
                  <arg>--threads</arg>
      <arg>3</arg>

    2)      Disable Why did it match on production
     Remove   <arg>--whymatch</arg> from Dgrpah Defaults
  
    3)      Remove Spell correction and Did you mean from Dgrpah defaults if it is not used on the front end
   <arg>--spl</arg>
    <arg>--dym</arg>

    4)      Remove Compound dimension if it is not used in the application from the DataIngest.xml

<dgidx id="Dgidx" host-id="ITLHost">
    <properties>
      <property name="numLogBackups" value="10" />
      <property name="numIndexBackups" value="3" />
    </properties>
    <args>
      <arg>-v</arg>
      <arg>--compoundDimSearch</arg>
      <arg>--lang</arg>
      <arg>${LANGUAGE_ID}</arg>
    </args>
    <log-dir>./logs/dgidxs/Dgidx</log-dir>
    <input-dir>./data/forge_output</input-dir>
    <output-dir>./data/dgidx_output</output-dir>
    <temp-dir>./data/temp</temp-dir>
    <run-aspell>true</run-aspell>
  </dgidx>