We could specify our own ordering, say RENT, MORTGAGE, OWN (describing degrees of ownership) but this is also not entirely satisfactory: if we encode it using integers, we postulate that the difference between RENT and MORTGAGE is the same as the difference between MORTGAGE and OWN, and the difference between RENT and OWN is twice the distance between MORTGAGE and OWN. Both pandas and scikit-learn by default use the lexical ordering of categories, so MORTGAGE corresponds to 0, OWN to 1 and RENT to 2. However, this is often problematic as this imposes an order and a distance between the different categories, that might not accurately reflect the semantics of the data. We could create a new dataframe using these integer codes, which now could be interpreted by a machine learning model. The encoding that is most appropriate depends on the model you’re using, but there are some general encoding schemes that are frequently used. However, that gives you more control over the processing, and a better idea of what happens to your data. Scikit-learn requires you to explicitly handle categorical features in most cases, which is unlike some other libraries and frameworks. The loan_status is our classification target (which means it’s also a discrete variable, but we don’t consider it a feature and won’t process it as such). The loan amount is a continuous feature it’s an integer amount, but grade and home ownership are categorical.