uk.ac.man.entitytagger.matching.matchers
Class VariantDictionaryMatcher

java.lang.Object
  extended by uk.ac.man.entitytagger.matching.Matcher
      extended by uk.ac.man.entitytagger.matching.matchers.VariantDictionaryMatcher
All Implemented Interfaces:
Sizeable

public class VariantDictionaryMatcher
extends Matcher
implements Sizeable

Class for performing NER dictionary matching against text. The dictionaries should contain a list of all possible variations of the strings that one would like to match Class objects are created using database details and potentially a identifier; when used the first time, the matcher will load dictionary terms and identifiers from the database for the specified identifier

Author:
Martin Gerner

Nested Class Summary
 
Nested classes/interfaces inherited from class uk.ac.man.entitytagger.matching.Matcher
Matcher.Disambiguation
 
Field Summary
private  java.sql.Connection conn
          SQL database connection and table names, from which dictionaries should be loaded initially
private  boolean ignoreCase
           
private  long size
          very rough estimate of the dictionary memory footprint, in bytes
private  java.lang.String[] tableNames
           
private  java.lang.String tag
          the identifier for this particular matcher; dictionary terms will only be loaded from the database where this tag matches a tag column
private  java.lang.String[] terms
          array containing all terms in the dictionary
private  java.lang.String[][] termToIdsMap
          array mapping terms to dictionary identifiers: termToIdsMap[i] contains all IDs for the term terms[i]
private  java.util.regex.Pattern tokenizationPattern
           
 
Constructor Summary
VariantDictionaryMatcher(java.sql.Connection conn, java.lang.String[] tableNames, java.lang.String tag, boolean ignoreCase)
           
VariantDictionaryMatcher(java.lang.String[][] termToIdsMap, java.lang.String[] terms, boolean ignoreCase)
           
 
Method Summary
private  java.util.List<java.lang.Integer> getMatchIds(java.util.List<Pair<java.lang.Integer>> tokenLocations, int i, java.lang.String matchText)
          Performs dictionary matching on the text, finding terms that start with the token 'token'.
private  void init()
          Will load the dictionary terms and identifiers from the database, sort the terms and set up the proper mappings
static VariantDictionaryMatcher load(java.io.File inFile, boolean ignoreCase)
           
static VariantDictionaryMatcher load(java.io.InputStream stream, boolean ignoreCase)
           
private static java.util.Map<java.lang.String,java.util.Map<java.lang.String,java.util.Set<java.lang.String>>> loadFileSeparated(java.io.File inFile, boolean ignoreCase)
           
private  java.util.Map<java.lang.String,java.util.Set<java.lang.String>> loadFromDB()
          Loads a dictionary from a database
static java.util.Map<java.lang.String,Matcher> loadSeparated(java.io.File[] inFiles, boolean ignoreCase)
           
static java.util.Map<java.lang.String,Matcher> loadSeparatedFromDB(java.sql.Connection conn, java.lang.String[] tableNames, boolean ignoreCase)
           
static CacheMap<java.lang.String,VariantDictionaryMatcher> loadSeparatedFromDBCached(java.sql.Connection conn, java.lang.String[] tableNames, boolean ignoreCase, long maxSize, java.util.logging.Logger logger)
          Creates a Map from tag identifiers to VariantDictionaryMatcher objects.
private static java.util.Map<java.lang.String,java.util.Set<java.lang.String>> loadStream(java.io.InputStream inputStream, boolean ignoreCase)
           
 java.util.List<Mention> match(java.lang.String text, Document doc)
          Search a given text for mentions
 int size()
           
 long sizeof()
          Gives a rough estimate of the memory consumption of this object.
 
Methods inherited from class uk.ac.man.entitytagger.matching.Matcher
combineMatches, detectEnumerations, disambiguate, isValidMatch, match, match, match, performAcronymResolution
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

terms

private java.lang.String[] terms
array containing all terms in the dictionary


termToIdsMap

private java.lang.String[][] termToIdsMap
array mapping terms to dictionary identifiers: termToIdsMap[i] contains all IDs for the term terms[i]


conn

private java.sql.Connection conn
SQL database connection and table names, from which dictionaries should be loaded initially


tableNames

private java.lang.String[] tableNames

tag

private java.lang.String tag
the identifier for this particular matcher; dictionary terms will only be loaded from the database where this tag matches a tag column


tokenizationPattern

private final java.util.regex.Pattern tokenizationPattern

ignoreCase

private boolean ignoreCase

size

private long size
very rough estimate of the dictionary memory footprint, in bytes

Constructor Detail

VariantDictionaryMatcher

public VariantDictionaryMatcher(java.lang.String[][] termToIdsMap,
                                java.lang.String[] terms,
                                boolean ignoreCase)

VariantDictionaryMatcher

public VariantDictionaryMatcher(java.sql.Connection conn,
                                java.lang.String[] tableNames,
                                java.lang.String tag,
                                boolean ignoreCase)
Parameters:
conn - Connection to the database from which the dictionary should be loaded
tableNames - Names of the table(s) where terms should be loaded from
species - species identifier, specifying what part of the tables to load
ignoreCase - whether to ignore case when matching or not
Method Detail

size

public int size()
Overrides:
size in class Matcher

load

public static VariantDictionaryMatcher load(java.io.File inFile,
                                            boolean ignoreCase)

load

public static VariantDictionaryMatcher load(java.io.InputStream stream,
                                            boolean ignoreCase)

init

private void init()
Will load the dictionary terms and identifiers from the database, sort the terms and set up the proper mappings


loadSeparatedFromDB

public static java.util.Map<java.lang.String,Matcher> loadSeparatedFromDB(java.sql.Connection conn,
                                                                          java.lang.String[] tableNames,
                                                                          boolean ignoreCase)

loadSeparatedFromDBCached

public static CacheMap<java.lang.String,VariantDictionaryMatcher> loadSeparatedFromDBCached(java.sql.Connection conn,
                                                                                            java.lang.String[] tableNames,
                                                                                            boolean ignoreCase,
                                                                                            long maxSize,
                                                                                            java.util.logging.Logger logger)
Creates a Map from tag identifiers to VariantDictionaryMatcher objects. Using the map, dictionaries for particular tags can be retrieved, and used fro gene NER matching for that particular species The map is cached: the first time that a particular tag is retrieved, it will be loaded from a database; subsequent accesses will use the pre-loaded dictionary If the total size of the dictionaries (as given by the sum of their sizeof() method) exceeds maxSize, rarely used dictionaries will be unloaded from memory Access calls for unloaded dictionaries will result in them being loaded from the database again.

Parameters:
conn -
tableNames - The tables from which the dictionaries should be loaded
maxSize - The maximum size that the user would like the dictionaries to occupy, in bytes. The map will try to adhere to this, roughly. Also note that the size estimates of the dictionaries are very rough.
logger -
Returns:

loadSeparated

public static java.util.Map<java.lang.String,Matcher> loadSeparated(java.io.File[] inFiles,
                                                                    boolean ignoreCase)

loadFileSeparated

private static java.util.Map<java.lang.String,java.util.Map<java.lang.String,java.util.Set<java.lang.String>>> loadFileSeparated(java.io.File inFile,
                                                                                                                                 boolean ignoreCase)

loadStream

private static java.util.Map<java.lang.String,java.util.Set<java.lang.String>> loadStream(java.io.InputStream inputStream,
                                                                                          boolean ignoreCase)

loadFromDB

private java.util.Map<java.lang.String,java.util.Set<java.lang.String>> loadFromDB()
Loads a dictionary from a database

Returns:

match

public java.util.List<Mention> match(java.lang.String text,
                                     Document doc)
Description copied from class: Matcher
Search a given text for mentions

Specified by:
match in class Matcher
Parameters:
text -
doc - object containing the document ID for the mentions
Returns:
a list of gene mentions found in the text using the loaded dictionary

getMatchIds

private java.util.List<java.lang.Integer> getMatchIds(java.util.List<Pair<java.lang.Integer>> tokenLocations,
                                                      int i,
                                                      java.lang.String matchText)
Performs dictionary matching on the text, finding terms that start with the token 'token'.

Parameters:
tokenLocations - a list of all token coordinates in the text
token - the token for which we would like to start our scan
text - the text that we would like to scan
Returns:

sizeof

public long sizeof()
Gives a rough estimate of the memory consumption of this object.

Specified by:
sizeof in interface Sizeable