Class SocialGeo

java.lang.Object
org.opensextant.extractors.geo.social.SocialGeo
Direct Known Subclasses:
GeoInferencer

public abstract class SocialGeo extends Object
A base-class that has the various hooks for logging, dev/test/evaluation, common dictionaries/resources, and helpful connectivity items.
Author:
ubaldino
  • Field Details

    • log

      protected final org.slf4j.Logger log
    • evalMode

      protected boolean evalMode
    • inferencerID

      public String inferencerID
    • inferencerDescription

      public String inferencerDescription
    • countries

      protected org.opensextant.util.GeonamesUtility countries
    • allCountries

      protected Map<String,org.opensextant.data.Country> allCountries
      If you populate allCountries with
    • basicCountryNames

      protected Map<String,org.opensextant.data.Country> basicCountryNames
      A particular hashing of the list of country names.
    • US_STATES

      protected final Map<String,org.opensextant.data.Place> US_STATES
  • Constructor Details

    • SocialGeo

      public SocialGeo()
  • Method Details

    • configure

      public abstract void configure() throws org.opensextant.ConfigException
      Configure your implementation.
      Throws:
      org.opensextant.ConfigException
    • close

      public abstract void close()
      Release resources quietly.
    • isValue

      public static boolean isValue(String v)
      Generally useful test of string values.
      Parameters:
      v -
      Returns:
    • scoreCountryPrediction

      public int scoreCountryPrediction(org.opensextant.data.Country C, org.opensextant.data.social.Tweet tw)
      This score as a boost for any sort of disambiguation of ties or close scores in predictions.
      
       Points:  A Country may score in 0 or more of these three categories: TZ, UTC, LANG.
      
            TZ
       +3 - Country contains timezone named by Tweet.timezone
      
            UTC
       +3 - Country contains UTC offset named by Tweet.utcOffset (Hours);
       +4 - Or if Tweet is in period of DST and Country observes that DST offset.
            This is slightly less believable because users apparently do not always adjust TZ and time on devices.
            Just the same, if country uses DST and so is user, then that is more significant than without
      
            LANG
       +3 - Language of User and of Text are both Primary language of Country
       +2 - either language is Primary language of Country
       +1 - language of text is spoken in Country
      
            LON
            TODO:  consider (Country.LatLon ~ Tweet.UTC) ? within 5 degrees.  Countries vary by size this
            makes little sense.  But for Cities and States it makes more sense.
      
       MAX score is 3 + 4 + 3 = 10
       
      Parameters:
      C - a country prediction for the tweet.
      tw - the tweet
      Returns:
      score 1 to ~20
    • populateBasicCountryNames

      public void populateBasicCountryNames()
      Create a lookup of the most common country names. This is just a pure ASCII listing... of ISO country names. To get more country names, populateAllCountries() should be used.
    • populateAllCountries

      public void populateAllCountries(SolrGazetteer gaz)
      Populate the allCountries listing. Not all pipeline apps make use of SolrGazetteer or do geo work so this is not part of setup.
      Parameters:
      gaz -
    • loadProvinceNames

      public void loadProvinceNames() throws IOException
      Geonames Helpers. Attach Province name if useful. Ideally keep data coded in databases, and render name at presentation or export time, if needed. But no need to store superfluous name data that is just a reflection of things that are coded.
      Throws:
      IOException
    • getUSStateByName

      public org.opensextant.data.Place getUSStateByName(String name)
      Lookup US States.
      Parameters:
      name -
      Returns:
    • getUSStateByCode

      public org.opensextant.data.Place getUSStateByCode(String code)
      A dot-separated code, country code + FIPS numeric
      Parameters:
      code - CC.FF
      Returns:
    • loadUSStates

      public void loadUSStates() throws IOException
      CAVEAT: For now this only loads US states, despite us loading
      Throws:
      IOException
    • getConfidence

      protected double getConfidence(double c)
    • getCountryNamed

      public org.opensextant.data.Country getCountryNamed(String nm)
      Parameters:
      nm -
      Returns:
    • inferPlaceRecursively

      public org.opensextant.data.Place inferPlaceRecursively(SolrGazetteer gaz, org.opensextant.data.Geocoding poi) throws org.apache.solr.client.solrj.SolrServerException, IOException
      Throws:
      org.apache.solr.client.solrj.SolrServerException
      IOException
    • setProvinceName

      public void setProvinceName(org.opensextant.data.Place somePlace)
      set Province name from given codes on somePlace.
      Parameters:
      somePlace - Place object with CC and ADM1 codes set.
    • inferPlaceRecursively

      public org.opensextant.data.Place inferPlaceRecursively(SolrGazetteer gaz, org.opensextant.data.Geocoding poi, boolean requireADM1) throws org.apache.solr.client.solrj.SolrServerException, IOException
      Try to find closest P/PPL* (city or village) within 5 km. Or a local site or landmark. If not try at a radius = 10, then at 30 KM, and still if not, try a region, say ADM1 or ADM2 place boundary if one is nearby within 100 KM. If NOT, ... then maybe you are in a remote, sparse territory or over water. TODO: Province ID is helpful for many things -- missing ADM1 codes is a general problem. Fix missing ADM1 codes in gazetteer, e.g., use ESRI free data, geonames.org, etc. NOTE: there are not any missing ADM1 codes; USGS is solid. TODO: Solr 4.x has a major memory problem when trying to find closest points. In theory it should be indexed rather well, however for geodetic search it tries to load ALL the index for that 'geo' field into RAM (based on experience). From there it can sort geodetically to find a closest point. This is not helpful -- so as a work around this recursive search outward finds any items close by (SolrGazetteer.placeAt() sorts results outside of Solr). The issue is that the search returns only first 25 rows of unsorted results, then sorts geodetically. So for this work around try to minimize results by select feature types or something.
      Parameters:
      gaz - an intialized SolrGazetteer
      poi - point of interest
      requireADM1 - true if ADM1 level resolution is desired.
      Returns:
      a single place that appears to be closest to POI
      Throws:
      org.apache.solr.client.solrj.SolrServerException
      IOException
    • flattenPrecision

      protected static void flattenPrecision(org.opensextant.data.Geocoding geo, org.opensextant.extractors.xcoord.GeocoordPrecision prec)
      facilitate getting a simple precision metric. +/- 1m is sufficient for tracking points extracted from text.
      Parameters:
      geo -
      prec -