全部版块 我的主页
论坛 数据科学与人工智能 数据分析与数据科学 SAS专版
1839 0
2014-05-13
Level 1
Scenario:
There is a file called “Big_Data” and a file called “Needed_Data”.  The “Needed_Data file contains a list of fields that needs to be pulled from the “Big_Data” file.  When the code runs it should create the file called “Output_Data” and it should contain all the needed fields plus the primary key of the “Big_Data” file.
Input  files
File one:
•        File Name:        Big_Data
•        File Type:        SAS dataset
•        Records:        10 million records.
•        Variables:        5 thousand variables per record.
•        Primary key:        Account_number


File two:
•        File Name:        Needed_Data.
•        File Type        SAS dataset.
•        Records:        1 to X number.
•        Variables:        1
•        Varname:
o        Keep_list: Contains the name of a single variable that would be on the Big_Data file.  

Example data:
  Keep_list
  Apples
  Oranges         
  Grapes


Processing requirement:
Output file “Output_Data” should contain all the fields that was requested in the “Needed_Data” file plus the primary key.

Output and Usage requirement:
None.

Error handling requirement:
None.

Suggestion:
For now assume the “Needed_Data” file will always contain variables that are on the “Big_Data” file.

Level 2

All requirements identical to Level 1 except for the following changes.

Input  files
File two:
•        File Name:        Needed_Data.
•        File Type        SAS dataset.
•        Records:        1 to X number.
•        Variables:        3
•        Varname:       
o        Keep_list: Name of a single variable that is on the “Big_Data” file.
o        Where_list: The expected value of the variable in the keep list.
o        Rename_list: The name of the variable to be named in the “Output_Data file”.  

Example data:
  Keep_list        Where_list        Rename List       
  Apples        Red                Ambrosia
                Oranges        Orange         
                Grapes        Green                Seedless  

Processing requirement:
Output file “Output_Data” should contain all the fields that was requested in the “Needed_Data” file plus the primary key.  The output fields should be renamed where asked it was asked for.

Example Output_data:
  Account_number
  Ambrosia
          Oranges         
          Seedless  

Output and Usage requirement:
None.

Error handling requirement:
Do not expect all fields being request in the “Needed_Data” file is on the “Big_Data” file.  If a field is missing it should not show up on the “Output_Data” file and a note should be add to the log indicating the data field was not available. The code should then continue with the remainder of the fields.

Suggestion:
Do not assume the “Needed_Data” file will always contain variables that are on the “Big_Data” file

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群