|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object uk.ac.man.entitytagger.matching.Matcher uk.ac.man.entitytagger.matching.matchers.VariantDictionaryMatcher
public class VariantDictionaryMatcher
Class for performing NER dictionary matching against text. The dictionaries should contain a list of all possible variations of the strings that one would like to match Class objects are created using database details and potentially a identifier; when used the first time, the matcher will load dictionary terms and identifiers from the database for the specified identifier
Nested Class Summary |
---|
Nested classes/interfaces inherited from class uk.ac.man.entitytagger.matching.Matcher |
---|
Matcher.Disambiguation |
Field Summary | |
---|---|
private java.sql.Connection |
conn
SQL database connection and table names, from which dictionaries should be loaded initially |
private boolean |
ignoreCase
|
private long |
size
very rough estimate of the dictionary memory footprint, in bytes |
private java.lang.String[] |
tableNames
|
private java.lang.String |
tag
the identifier for this particular matcher; dictionary terms will only be loaded from the database where this tag matches a tag column |
private java.lang.String[] |
terms
array containing all terms in the dictionary |
private java.lang.String[][] |
termToIdsMap
array mapping terms to dictionary identifiers: termToIdsMap[i] contains all IDs for the term terms[i] |
private java.util.regex.Pattern |
tokenizationPattern
|
Constructor Summary | |
---|---|
VariantDictionaryMatcher(java.sql.Connection conn,
java.lang.String[] tableNames,
java.lang.String tag,
boolean ignoreCase)
|
|
VariantDictionaryMatcher(java.lang.String[][] termToIdsMap,
java.lang.String[] terms,
boolean ignoreCase)
|
Method Summary | |
---|---|
private java.util.List<java.lang.Integer> |
getMatchIds(java.util.List<Pair<java.lang.Integer>> tokenLocations,
int i,
java.lang.String matchText)
Performs dictionary matching on the text, finding terms that start with the token 'token'. |
private void |
init()
Will load the dictionary terms and identifiers from the database, sort the terms and set up the proper mappings |
static VariantDictionaryMatcher |
load(java.io.File inFile,
boolean ignoreCase)
|
static VariantDictionaryMatcher |
load(java.io.InputStream stream,
boolean ignoreCase)
|
private static java.util.Map<java.lang.String,java.util.Map<java.lang.String,java.util.Set<java.lang.String>>> |
loadFileSeparated(java.io.File inFile,
boolean ignoreCase)
|
private java.util.Map<java.lang.String,java.util.Set<java.lang.String>> |
loadFromDB()
Loads a dictionary from a database |
static java.util.Map<java.lang.String,Matcher> |
loadSeparated(java.io.File[] inFiles,
boolean ignoreCase)
|
static java.util.Map<java.lang.String,Matcher> |
loadSeparatedFromDB(java.sql.Connection conn,
java.lang.String[] tableNames,
boolean ignoreCase)
|
static CacheMap<java.lang.String,VariantDictionaryMatcher> |
loadSeparatedFromDBCached(java.sql.Connection conn,
java.lang.String[] tableNames,
boolean ignoreCase,
long maxSize,
java.util.logging.Logger logger)
Creates a Map from tag identifiers to VariantDictionaryMatcher objects. |
private static java.util.Map<java.lang.String,java.util.Set<java.lang.String>> |
loadStream(java.io.InputStream inputStream,
boolean ignoreCase)
|
java.util.List<Mention> |
match(java.lang.String text,
Document doc)
Search a given text for mentions |
int |
size()
|
long |
sizeof()
Gives a rough estimate of the memory consumption of this object. |
Methods inherited from class uk.ac.man.entitytagger.matching.Matcher |
---|
combineMatches, detectEnumerations, disambiguate, isValidMatch, match, match, match, performAcronymResolution |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private java.lang.String[] terms
private java.lang.String[][] termToIdsMap
private java.sql.Connection conn
private java.lang.String[] tableNames
private java.lang.String tag
private final java.util.regex.Pattern tokenizationPattern
private boolean ignoreCase
private long size
Constructor Detail |
---|
public VariantDictionaryMatcher(java.lang.String[][] termToIdsMap, java.lang.String[] terms, boolean ignoreCase)
public VariantDictionaryMatcher(java.sql.Connection conn, java.lang.String[] tableNames, java.lang.String tag, boolean ignoreCase)
conn
- Connection to the database from which the dictionary should be loadedtableNames
- Names of the table(s) where terms should be loaded fromspecies
- species identifier, specifying what part of the tables to loadignoreCase
- whether to ignore case when matching or notMethod Detail |
---|
public int size()
size
in class Matcher
public static VariantDictionaryMatcher load(java.io.File inFile, boolean ignoreCase)
public static VariantDictionaryMatcher load(java.io.InputStream stream, boolean ignoreCase)
private void init()
public static java.util.Map<java.lang.String,Matcher> loadSeparatedFromDB(java.sql.Connection conn, java.lang.String[] tableNames, boolean ignoreCase)
public static CacheMap<java.lang.String,VariantDictionaryMatcher> loadSeparatedFromDBCached(java.sql.Connection conn, java.lang.String[] tableNames, boolean ignoreCase, long maxSize, java.util.logging.Logger logger)
conn
- tableNames
- The tables from which the dictionaries should be loadedmaxSize
- The maximum size that the user would like the dictionaries to occupy, in bytes. The map will try to adhere to this, roughly. Also note that the size estimates of the dictionaries are very rough.logger
-
public static java.util.Map<java.lang.String,Matcher> loadSeparated(java.io.File[] inFiles, boolean ignoreCase)
private static java.util.Map<java.lang.String,java.util.Map<java.lang.String,java.util.Set<java.lang.String>>> loadFileSeparated(java.io.File inFile, boolean ignoreCase)
private static java.util.Map<java.lang.String,java.util.Set<java.lang.String>> loadStream(java.io.InputStream inputStream, boolean ignoreCase)
private java.util.Map<java.lang.String,java.util.Set<java.lang.String>> loadFromDB()
public java.util.List<Mention> match(java.lang.String text, Document doc)
Matcher
match
in class Matcher
text
- doc
- object containing the document ID for the mentions
private java.util.List<java.lang.Integer> getMatchIds(java.util.List<Pair<java.lang.Integer>> tokenLocations, int i, java.lang.String matchText)
tokenLocations
- a list of all token coordinates in the texttoken
- the token for which we would like to start our scantext
- the text that we would like to scan
public long sizeof()
sizeof
in interface Sizeable
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |