I am having an issue with some R code. I am trying to classify text values from a column into a new column. My data is a collection of tags used on the gis.stackexchange site, which has ~2,500 rows. My goal is to classify the tags as either COTS, FOSS, or other. Reviewing the tags there are two "scenarios"; tags that are used once (i.e. anaconda) and tags that have a term used multiple times (i.e. qgis, qgis-desktop, qgis-server, etc.). This scenario is true for both COTS and FOSS tags.
My approach was to do the following:
- create a vector with all tags that represent FOSS
- create a vector with all tags that represent COTS
- create a new column called software and code using ifelse
- ifelse - where the tagName is %in% FOSS then code as FOSS
- in the ifelse use grep on the FOSS vector to pattern match tags that may be used multiple times (i.e. qgis) and code as FOSS
- Repeat this for COTS
I am getting an issue where the last grep (COTS) is being coded as FOSS. Obviously there is something wrong, but I cannot seem to figure out the issue. Below is the code and a link to the source data.
Tag vectors -- FOSS and COTS
foss <- c("anaconda", "android", "apache", "aptana", "google", "blender", "cordova", "docker", "drupal", "eclipse", "facebook", "firefox", "ftools", "fwtools", "geodjango", "geopandas", "geomoose", "geonetwork", "geonode", "geotools", "ggmap", "ggplot2", "gimp", "github", "gme", "chrome", "gvsig", "h2gis", "hadoop", "inkscape", "lastools", "laszip", "mongodb", "neo4j", "numpy", "open-data-kit", "opencv", "opendronemap", "openev", "opengeo-suite-composer", "opengl", "openjump", "openstreetmap", "opentopomap", "opentripplanner", "openwind", "orfeo-toolbox", "pandas", "pdal", "pgrouting", "pg2shape", "phonegap", "plpgsql", "ppygis", "pydev", "pygdal", "pyproj", "pyqspatialite", "rasterlite", "raster2pgsql", "rdal", "saga", "shapely", "shp2pgsql", "sp", "sf", "spatialite-gui", "three-js", "unity3d", "wordpress", "youtube", "bing-maps", "dropbox", "instagram", "sketchup", "carto", "django", "gdal", "geoserver", "grass", "jupyter", "leaflet", "mapbox", "matplotlib", "mysql", "ogr", "openlayers", "osgeo", "osm", "pgadmin", "postgis", "postgresql", "proj4", "pyqgis", "qgis", "qt", "scikit", "scipy", "tilemill")
cots <- c("autodesk", "bentley", "cityengine", "drone2map", "ecognition", "envi", "er-mapper", "et-geowizards", "excel", "geomatica", "geosoft", "global-mapper", "illustrator", "mac", "matlab", "microstation", "modelbuilder", "pix4d", "plsql", "powerpoint", "silverlight", "spss", "tableau", "xtools-pro", "mapinfo", "arc", "oracle", "erdas", "esri", "fme", "microsoft", "-analyst")
Create new column with classified values calculated based on tag vector
tags$software <- ifelse(tags$tagName %in% foss, "FOSS",
ifelse(grep(foss, tags$tagName, fixed = TRUE), "FOSS",
ifelse(tags$tagName %in% cots, "COTS",
ifelse(grep(cots, tags$tagName, fixed = TRUE), "COTS",
"other"))))
When I run the code the following error is produced: argument 'pattern' has length > 1 and only the first element will be used
I am sure it is a very simple issue, but I cannot seem to figure it out.
Aucun commentaire:
Enregistrer un commentaire