mercredi 19 septembre 2018

Classifying tags in R with grepl in ifelse

I am having an issue with some R code. I am trying to classify text values from a column into a new column. My data is a collection of tags used on the gis.stackexchange site, which has ~2,500 rows. My goal is to classify the tags as either COTS, FOSS, or other. Reviewing the tags there are two "scenarios"; tags that are used once (i.e. anaconda) and tags that have a term used multiple times (i.e. qgis, qgis-desktop, qgis-server, etc.). This scenario is true for both COTS and FOSS tags.

My approach was to do the following:

  1. create a vector with all tags that represent FOSS
  2. create a vector with all tags that represent COTS
  3. create a new column called software and code using ifelse
  4. ifelse - where the tagName is %in% FOSS then code as FOSS
  5. in the ifelse use grep on the FOSS vector to pattern match tags that may be used multiple times (i.e. qgis) and code as FOSS
  6. Repeat this for COTS

I am getting an issue where the last grep (COTS) is being coded as FOSS. Obviously there is something wrong, but I cannot seem to figure out the issue. Below is the code and a link to the source data.

Shared folder with source CSV

Tag vectors -- FOSS and COTS

foss <- c("anaconda", "android", "apache", "aptana", "google", "blender", "cordova", "docker", "drupal", "eclipse", "facebook", "firefox", "ftools", "fwtools", "geodjango", "geopandas", "geomoose", "geonetwork", "geonode", "geotools", "ggmap", "ggplot2", "gimp", "github", "gme", "chrome", "gvsig", "h2gis", "hadoop", "inkscape", "lastools", "laszip", "mongodb", "neo4j", "numpy", "open-data-kit", "opencv", "opendronemap", "openev", "opengeo-suite-composer", "opengl", "openjump", "openstreetmap", "opentopomap", "opentripplanner", "openwind", "orfeo-toolbox", "pandas", "pdal", "pgrouting", "pg2shape", "phonegap", "plpgsql", "ppygis", "pydev", "pygdal", "pyproj", "pyqspatialite", "rasterlite", "raster2pgsql", "rdal", "saga", "shapely", "shp2pgsql", "sp", "sf", "spatialite-gui", "three-js", "unity3d", "wordpress", "youtube", "bing-maps", "dropbox", "instagram", "sketchup", "carto", "django", "gdal", "geoserver", "grass", "jupyter", "leaflet", "mapbox", "matplotlib", "mysql", "ogr", "openlayers", "osgeo", "osm", "pgadmin", "postgis", "postgresql", "proj4", "pyqgis", "qgis", "qt", "scikit", "scipy", "tilemill")

cots <- c("autodesk", "bentley", "cityengine", "drone2map", "ecognition", "envi", "er-mapper", "et-geowizards", "excel", "geomatica", "geosoft", "global-mapper", "illustrator", "mac", "matlab", "microstation", "modelbuilder", "pix4d", "plsql", "powerpoint", "silverlight", "spss", "tableau", "xtools-pro", "mapinfo", "arc", "oracle", "erdas", "esri", "fme", "microsoft", "-analyst")

Create new column with classified values calculated based on tag vector

tags$software <- ifelse(tags$tagName %in% foss, "FOSS", 
ifelse(grep(foss, tags$tagName, fixed = TRUE), "FOSS",
ifelse(tags$tagName %in% cots, "COTS", 
ifelse(grep(cots, tags$tagName, fixed = TRUE), "COTS", 
  "other"))))

When I run the code the following error is produced: argument 'pattern' has length > 1 and only the first element will be used

I am sure it is a very simple issue, but I cannot seem to figure it out.

Aucun commentaire:

Enregistrer un commentaire