Summary
Research and develop a machine learning model that encodes the industry that a client firm is working in.
Business context
Industry, or sector, is one of the most prominent features of firms, used to group them for business analytical purposes (reports, dashboards), and used in (machine learning) models for predictions. Industry is also among the better-known features of our clients; we have a so called NAICS codes (North American Industry Classification System) assigned to close to 100% of our clients. However, for other firms that do not bank with ING we do not have an industry classification.
Project context
We would like to have a model that classifies the industries that a firm operates in.
Who a firm's buyers and suppliers are, is largely dictated by the industry that the firm operates in. A hotel is unlikely to receive large sums from a bakery. At ING we see our clients' buyers and suppliers, but we only know the buyers' and suppliers' industries if they in turn they are ING clients. We thus have only partial information of these business partners, typically more complete for the smaller businesses, at least within NL.
The model should consider the industries of the buyers and suppliers, and potentially how much is payed to/from them and infer the industry of the firm. It can be trained on our own clients, for which we know the industry, and then applied to external firms to estimate their industries.
Where classical machine learning problems have a fixed set of features as input, this case (initially) does not: every firm has a different number of buyers and suppliers. Furthermore, there is no order in these; one supplier does not "go before" another. The model needs to be able to deal with this. An easy solution is to embed the buyer and supplier industries into a TF-IDF type vector, but other solutions may be out there.
There are degrees of being wrong. If a firm is a pig farm, but the model classifies it as a sheep farm, then the model is less far off than when it classifies it as an electric power generation company. NAICS codes luckily contain a hierarchy that can be used to assess how far off the model is: the first two digits give the sector, the third digit the subsector, the fourth the industry group, etc. Up to six digits.
Research tasks
- Perform literature research into methods that can handle variable length and unordered inputs
- Choose a number of models that are suitable for the task, the input, the data size, etc.
These should include but don't have to be limited to well-known methods such as logistic regression, boosted decision trees, neural networks, k-means or geometric models (e.g. Node2vec) - Construct one or more learning objectives that incorporate degrees of wrongness
- Construct a number of metrics that measure at different levels of wrongness
- Create a pipeline with which you can train and test different combinations of models, objectives
- Error analysis: Search for specific characteristics for which a model gets the industry more or more often wrong. E.g. When there are only a few buyers or suppliers, for certain buyers or suppliers, for little money spent or received, etc.
This important step can help you improve the model, but also helps us understand how reliable the predicted industry is for different characteristics. - Compare models and objectives, perform error analysis, iterate and improve
- Apply the best competing model(s) to our own clients to spot firms that may have been wrongly labeled, or that may operate in multiple industries
Research goals
- Compare various models and objectives for the task
- Discuss the best competing model(s) for its assets (where it performs well) and liabilities (when it performs badly)
- Document all methods, models, tests and results in reproducible documentation
- Suggest clients that may have been wrongly labeled, or that may operate in multiple industries
Het salaris bedraagt €700