Using Modularization to Handle Complex Projects - python论坛 - 经管之家

› 论坛 › 数据科学与人工智能 › 数据分析与数据科学 › python论坛

Using Modularization to Handle Complex Projects

4525

20

收藏 2021-12-18

When you start programming in Python, it is very tempting to put all your program
code in a single file. There is no problem in defining functions and classes in the same
file where your main program is. This option is attractive to beginners because of the
ease of execution of the program and to avoid managing code in multiple files. But
a single-file program approach is not scalable for medium- to large-size projects. It
becomes challenging to keep track of all the various functions and classes that you define.
To overcome the situation, modular programming is the way to go for medium to large
projects. Modularity is a key tool to reduce the complexity of a project. Modularization
also facilitates efficient programming, easy debugging and management, collaboration,
and reusability. In this chapter, we will discuss how to build and consume modules and
packages in Python.

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

2021-12-18 17:52:28

宽客老丁发表于 2021-12-18 10:39
When you start programming in Python, it is very tempting to put all your program
code in a single ...

谢谢分享

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2021-12-19 16:38:27

感谢分享

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2021-12-20 08:59:53

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-15 03:07:57

it's good to know it.

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:14:13

Supervised vs. Unsupervised
Thefield of machine learning has two major branches—supervised learningand unsupervised learning—and plenty of sub-branches that bridge the two.
In supervised learning, the AI agent has access to labels, which it can use to improve its performance on some task. In the email spam filter problem, we have a dataset of emails with all the text within each and every email. We also know which of these emails are spam or not (the so-called labels). These labels are very valuable in helping the supervised learning AI separate the spam emails from the rest.
In unsupervised learning, labels are not available. Therefore, the task of the AI agent is not well-defined, and performance cannot be so clearly measured. Consider the email spam filter problem—this time without labels. Now, the AI agent will attempt to understand the underlying structure of emails, separating the database of emails into different groups such that emails within a group are similar to each other but different from emails in other groups.
This unsupervised learning problem is less clearly defined than the supervised learning problem and harder for the AI agent to solve. But, if handled well, the solution is more powerful.
Here’s why: the unsupervised learning AI may find several groups that it later tags as being “spam”—but the AI may also find groups that it later tags as being “important” or categorize as “family,” “professional,” “news,” “shopping,” etc. In other words, because the problem does not have a strictly defined task, the AI agent may find interesting patterns above and beyond what we initially were looking for.

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

点击查看更多内容…

2022-1-19 09:14:45

The Strengths and Weaknesses of Supervised Learning
Supervised learningexcels at optimizing performance in well-defined tasks with plenty of labels. For example, consider a very large dataset of images of objects, where each image is labeled. If the dataset is sufficiently large enough and we train using the right machine learning algorithms (i.e., convolutional neural networks) and with powerful enough computers, we can build a very good supervised learning-based image classification system.
As the supervised learning AI trains on the data, it will be able to measure its performance (via a cost function) by comparing its predicted image label with the true image label that we have on file. The AI will explicitly try to minimize this cost function such that its error on never-before-seen images (from a holdout set) is as low as possible
This is why labels are so powerful—they help guide the AI agent by providing it with an error measure. The AI uses the error measure to improve its performance over time. Without such labels, the AI does not know how successful it is (or isn’t) in correctly classifying images.
However, the costs of manually labeling an image dataset are high. And, even the best curated image datasets have only thousands of labels. This is a problem because supervised learning systems will be very good at classifying images of objects for which it has labels but poor at classifying images of objects for which it has no labels.

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:15:12

The Strengths and Weaknesses of Unsupervised Learning
Supervised learningwill trounce unsupervised learning at narrowly defined tasks for which we have well-defined patterns that do not change much over time and sufficiently large, readily available labeled datasets.
However, for problems where patterns are unknown or constantly changing or for which we do not have sufficiently large labeled datasets, unsupervised learning truly shines.
Instead of being guided by labels, unsupervised learning works by learning the underlying structure of the data it has trained on. It does this by trying to represent the data it trains on with a set of parameters that is significantly smaller than the number of examples available in the dataset. By performing this representation learning, unsupervised learning is able to identify distinct patterns in the dataset.
In the image dataset example (this time without labels), the unsupervised learning AI may be able to identify and group images based on how similar they are to each other and how different they are from the rest. For example, all the images that look like chairs will be grouped together, all the images that look like dogs will be grouped together, etc.

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:15:42

Data acquisition and exploration
Before we work with the dimensionality reduction algorithms, let’s load the libraries we will use:
# Import libraries'''Main'''importnumpyasnpimportpandasaspdimportos,timeimportpickle,gzip'''Data Viz'''importmatplotlib.pyplotaspltimportseabornassnscolor=snscolor_palette()importmatplotlibasmpl%matplotlibinline'''Data Prep and Model Evaluation'''fromsklearnimportpreprocessingasppfromscipy.statsimportpearsonrfromnumpy.testingimportassert_array_almost_equalfromsklearn.model_selectionimporttrain_test_splitfromsklearn.model_selectionimportStratifiedKFoldfromsklearn.metricsimportlog_lossfromsklearn.metricsimportprecision_recall_curve,average_precision_scorefromsklearn.metricsimportroc_curve,auc,roc_auc_scorefromsklearn.metricsimportconfusion_matrix,classification_report'''Algos'''fromsklearn.linear_modelimportLogisticRegressionfromsklearn.ensembleimportRandomForestClassifierimportxgboostasxgbimportlightgbmaslgb

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:16:11

Linear Projection vs. Manifold Learning
Thereare two major branches of dimensionality reduction. The first is known as linear projection, which involves linearly projecting data from a high-dimensional space to a low-dimensional space. This includes techniques such as principal component analysis, singular value decomposition, and random projection
Thesecond is known as manifold learning, which is also referred to as nonlinear dimensionality reduction. This involves techniques such as isomap, which learns the curved distance(also called the geodesic distance) between points rather than the Euclidean distance. Other techniques include multidimensional scaling (MDS), locally linear embedding (LLE), t-distributed stochastic neighbor embedding (t-SNE), dictionary learning, random trees embedding, and independent component analysis

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:16:41

Dictionary Learning
Onesuch method is dictionary learning, which learns the sparse representation of the original data. The resulting matrix is known as the dictionary, and the vectors in the dictionary are known as atoms
Assumingthere are dfeatures in the original data and natoms in the dictionary, we can have a dictionary that is either undercomplete, where n < d, or overcomplete, where n > d

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:17:07

Dictionary Learning
Onesuch method is dictionary learning, which learns the sparse representation of the original data. The resulting matrix is known as the dictionary, and the vectors in the dictionary are known as atoms
Assumingthere are dfeatures in the original data and natoms in the dictionary, we can have a dictionary that is either undercomplete, where n < d, or overcomplete, where n > d

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:17:32

Search for the Optimal Number of Principal Components
Now, let’s perform a few experiments by reducing the number of principal components PCA generates and evaluate the fraud detection results. We need the PCA-based fraud detection solution to have enough error on the rare cases that it can meaningfully separate fraud cases from the normal ones. But the error cannot be so low or so high for all the transactions that the rare and normal transactions are virtually indistinguishable.

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:17:55

Results using normal PCA and 27 principal components
As you can see, we are able to catch 80% of the fraud with 75% precision. This is very impressive considering that we did not use any labels. To make these results more tangible, consider that there are 190,820 transactions in the training set and only 330 are fraudulent.
Using PCA, we calculated the reconstruction error for each of these 190,820 transactions. If we sort these transactions by highest reconstruction error (also referred to as anomaly score) in descending order and extract the top 350 transactions from the list, we can see that 264 of these transactions are fraudulent.

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:18:16

Results using normal PCA and 27 principal components
As you can see, we are able to catch 80% of the fraud with 75% precision. This is very impressive considering that we did not use any labels. To make these results more tangible, consider that there are 190,820 transactions in the training set and only 330 are fraudulent.
Using PCA, we calculated the reconstruction error for each of these 190,820 transactions. If we sort these transactions by highest reconstruction error (also referred to as anomaly score) in descending order and extract the top 350 transactions from the list, we can see that 264 of these transactions are fraudulent.

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:18:32

Kernel PCA Anomaly Detection
Nowlet’s design a fraud detection solution using kernel PCA, which is a nonlinear form of PCA and is useful if the fraud transactions are not linearly separable from the nonfraud transactions.
Weneed to specify the number of components we would like to generate, the kernel (we will use the RBF kernel as we did in the previous chapter), and the gamma (which is set to 1/n_features by default, so 1/30 in our case). Wealso need to set the fit_inverse_transformto trueto apply the built-in inverse_transformfunction provided by Scikit-Learn.
Finally, because kernel PCA is so expensive to train with, we will train on just the first two thousand samples in the transactions dataset. This is not ideal but it is necessary to perform experiments quickly.

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:18:55

Dictionary Learning Anomaly Detection
Let’s use dictionary learningto develop a fraud detection solution. Recall that, in dictionary learning, the algorithm learns the sparse representation of the original data. Using the vectors in the learned dictionary, each instance in the original data can be reconstructed as a weighted sum of these learned vectors.
For anomaly detection, we want to learn an undercomplete dictionary so that the vectors in the dictionary are fewer in number than the original dimensions. With this constraint, it will be easier to reconstruct the more frequently occurring normal transactions and much more difficult to construct the rarer fraud transactions.
In our case, we will generate 28 vectors (or components).

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:19:19

Dictionary Learning Anomaly Detection
Let’s use dictionary learningto develop a fraud detection solution. Recall that, in dictionary learning, the algorithm learns the sparse representation of the original data. Using the vectors in the learned dictionary, each instance in the original data can be reconstructed as a weighted sum of these learned vectors.
For anomaly detection, we want to learn an undercomplete dictionary so that the vectors in the dictionary are fewer in number than the original dimensions. With this constraint, it will be easier to reconstruct the more frequently occurring normal transactions and much more difficult to construct the rarer fraud transactions.
In our case, we will generate 28 vectors (or components).

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:19:37

Rules-Based vs. Machine Learning
Usinga rules-based approach, we can design a spam filter with explicit rules to catch spam such as flag emails with “u” instead of “you,” “4” instead of “for,” “BUY NOW,” etc. But this system would be difficult to maintain over time as bad guys change their spam behavior to evade the rules. If we used a rules-based system, we would have to frequently adjust the rules manually just to stay up-to-date. Also, it would be very expensive to set up—think of all the rules we would need to create to make this a well-functioning system.

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:19:57

Rules-Based vs. Machine Learning
Usinga rules-based approach, we can design a spam filter with explicit rules to catch spam such as flag emails with “u” instead of “you,” “4” instead of “for,” “BUY NOW,” etc. But this system would be difficult to maintain over time as bad guys change their spam behavior to evade the rules. If we used a rules-based system, we would have to frequently adjust the rules manually just to stay up-to-date. Also, it would be very expensive to set up—think of all the rules we would need to create to make this a well-functioning system.

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

2022-1-19 09:20:33

Rules-Based vs. Machine Learning
Usinga rules-based approach, we can design a spam filter with explicit rules to catch spam such as flag emails with “u” instead of “you,” “4” instead of “for,” “BUY NOW,” etc. But this system would be difficult to maintain over time as bad guys change their spam behavior to evade the rules. If we used a rules-based system, we would have to frequently adjust the rules manually just to stay up-to-date. Also, it would be very expensive to set up—think of all the rules we would need to create to make this a well-functioning system.

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

相关推荐

栏目导航

热门文章

推荐文章

扫码加好友，拉您进群

各岗位、行业、专业交流群